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1) SAGG-RIAC is an architecture for active learning of inverse models in high- 
dimensional redundant spaces 
c/3 2) This allows a robot to learn efficiently distributions of parameterized motor 

policies that solve a corresponding distribution of parameterized tasks 

3) Active sampling of parameterized tasks, called active goal exploration, can 
t-H be significantly faster than direct active sampling of parameterized policies 

4) Active developmental exploration, based on competence progress, autonomously 
drives the system to progressively explore tasks of increasing learning complex- 
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Abstract 



o 

C*"") We introduce the Self-Adaptive Goal Generation - Robust Intelligent 

Adaptive Curiosity (SAGG-RIAC) architecture as an intrinsically moti- 
vated goal exploration mechanism which allows active learning of inverse 

• i-H models in high-dimensional redundant robots. This allows a robot to efR- 

ciently and actively learn distributions of parameterized motor skills/policies 
?H that solve a corresponding distribution of parameterized tasks/goals. The 

architecture makes the robot sample actively novel parameterized tasks 
in the task space, based on a measure of competence progress, each of 
which triggers low-level goal-directed learning of the motor policy param- 
eters that allow to solve it. For both learning and generalization, the 
system leverages regression techniques which allow to infer the motor pol- 
icy parameters corresponding to a given novel parameterized task, and 
based on the previously learnt correspondences between policy and task 
parameters. 
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We present experiments with high-dimensional continuous sensorimo- 
tor spaces in three different robotic setups: 1) learning the inverse kine- 
matics in a highly-redundant robotic arm, 2) learning omnidirectional lo- 
comotion with motor primitives in a quadruped robot, 3) an arm learning 
to control a fishing rod with a flexible wire. We show that 1) exploration 
in the task space can be a lot faster than exploration in the actuator space 
for learning inverse models in redundant robots; 2) selecting goals maxi- 
mizing competence progress creates developmental trajectories driving the 
robot to progressively focus on tasks of increasing complexity and is sta- 
tistically significantly more efficient than selecting tasks randomly, as well 
as more efficient than different standard active motor babbling methods; 
3) this architecture allows the robot to actively discover which parts of its 
task space it can learn to reach and which part it cannot. 

Keywords: Active Learning, Competence Based Intrinsic Motivation, 
Curiosity-Driven Task Space Exploration, Inverse Models, Goal Babbling, 
Autonomous Motor Learning, Developmental Robotics, Motor Develop- 
ment. 

1 Motor Learning and Exploration of Forward 
and Inverse Models 

To operate robustly and adaptively in the real world, robots need to know how 
to predict the consequences of their actions (called here forward models, map- 
ping typically X = (S,w$), where S is the state of a robot and irg : S — > A 
is a parameterized action policy, to the space of effect, or task space, Y). Re- 
versely, they need to be able to compute the action policies that can generate 
given effects (called here inverse models, (S, Y) — > ng). These models can be 
quite varied, for example mapping joint angles to hand position in the visual 
field, oscillation of the legs to body translation, movement of the hand in the 
visual field to movement of the end point of a tool, or properties of a hand 
tap an object to the sound it produces. Some of these models can be analyt- 
ically elaborated by an engineer and provided to a robot (e.g. forward and 
inverse kinematics of a rigid body robot). But in many cases, this is impos- 
sible either because the physical properties of the body itself cannot be easily 
modeled (e.g. compliant bodies with soft materials), or because it is impossible 
to anticipate all possible objects the robot might interact with, and thus the 
properties of objects. More generally, it is impossible to model a priori all the 
possible effects a robot can produce on its environment, especially when robots 
are targeted to interact with in everyday human environments, such as in as- 
sistive robotics. As a consequence, learning these models through experience 
becomes necessary. Yet, this poses highly difficult technical challenges, due in 
particular to the combination of the following facts: 1) these models are often 
high-dimensional, continuous and highly non-stationary spatially, and some- 
times temporally; 2) learning examples have to be collected autonomously and 
incrementally by robots; 3) learning, as wc will detail below, can happen either 
through self-experimentation or observation, and both of these takes significant 
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physical time in the real world. Thus, the number of training examples that 
can be collected in a life-time is strongly limited with regards to the size and 
complexity of the spaces. Advanced statistical learning techniques dedicated to 
incremental high-dimensional regression have been elaborated recently, such as 
|107l IT2] . Yet, these regression mechanisms are efficient only if the quality and 
quantity of data is high enough, which is not the case when using unconstrained 
exploration such as random exploration. Fundamental complementary mecha- 
nisms for guiding and constraining autonomous exploration and data collection 
for learning are needed. 

In this article, we present a particular approach to address constrained ex- 
ploration and learning of inverse models in robots, based on an active learning 
process inspired by mechanisms of intrinsically motivated learning and explo- 
ration in humans. As we will explain, the approach studies the combination of 
two principles for learning efficiently inverse models in high-dimensional redun- 
dant continuous spaces: 

• Active goal/task exploration in a parameterized task space: The 

architecture makes the robot sample actively novel parameterized tasks 
in the task space, each of which triggers low-level goal-directed learning 
of the motor policy parameters that allow to solve it. This allows to 
leverage the redundancies of the sensorimotor mapping, leading the system 
to explore densely only subregions of the space of action policies that are 
enough to achieve all possible effects. Thus, it does not need to learn 
a complete forward model and contrasts with approaches that directly 
sample action policy parameters and observe their effects in the task space. 
The system also leverages regression techniques which allow to infer the 
motor policy parameters corresponding to a given novel parameterized 
task, and based on the previously learnt correspondences between policy 
and task parameters. 

• Interestingness as empirically evaluated competence progress: 

The measure of interestingness for a given goal/task is based on com- 
petence progress empirically evaluated, i.e. how previous attempts of 
low-level optimization directed at similar goals allowed to improve the 
capability of the robot to reach these goals. 

In the rest of the section, we review various related approaches to constrain- 
ing exploration for motor learning. 

1.1 Constraining the Exploration 

A common way to carry out exploration is to use a set of constraints on guid- 
ing mechanisms and maximally reduce the size and/or dimensionality of ex- 
plored spaces. Social guidance is an important source of such constraints, widely 
studied in robot learning by demonstration/imitation where an external human 
demonstrator assists the robot in its learning process [U [531 [TH1 [53J [551 [Ml [50] ■ 
Typically, a robot teacher manually interacts with the robot by showing it a few 
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behaviors corresponding to a desired movement or goal that it will then have to 
reproduce. This strategy prevents the robot from performing any autonomous 
exploration of its space and requires an attentive demonstrator. Some other 
techniques allow more freedom to the human teacher and the robot by allowing 
the robot to explore. This is typically what happens in the reinforcement learn- 
ing (RL) framework where no demonstration is originally required and only a 
goal has to be fixed (as a reward) by the engineer who conceives the system 
|114[ l84 l 1117] . Nevertheless, when the robot evolves in high-dimensional and 
large spaces, the exploration still has to be constrained. For instance, studies 
presented in |80j combine RL with the framework of learning by demonstration. 
In their experiments, an engineer has to first define a specific goal in a task space 
as a handcrafted reward function, then, a human demonstrator provides a few 
examples of successful motor policies to reach that goal, which is then used to 
initialize an optimization procedure. The Shifting Setpoint Algorithm (SSA) 
introduced by Schaal and Atkeson [52] proposes another way to constrain the 
exploration process. Once a goal fixed in an handcrafted manner, a progressive 
exploration process is proposed: the system explores the world gradually from 
the start position and toward the goal by creating a local model around the 
current position and shifting in direction of the goal once this model is reliable 
enough, and so on. These kinds of techniques therefore restrain the exploration 
to narrow tubes of data targeted at learning specific tasks/goals that have to 
be defined by a human, either the programmer or a non-engineer demonstrator. 

These methods are efficient and useful in many cases. Nevertheless, in a 
framework where one would like a robot to learn a variety of tasks inside un- 
prepared spaces like in developmental robotics |129| \77\ 11281 [5] , or more simply 
full inverse models (i.e. having a robot learn to generate in a controlled man- 
ner many effects rather than only a single goal), it is not conceivable that a 
human being interacts with a robot at each instant or that an engineer designs 
and tunes a specific reward function for each novel task to be learned. For 
this reason, it is necessary to introduce mechanisms driving the learning and 
exploration of robots in an autonomous manner. 

1.2 Driving Autonomous Exploration 

Active learning algorithms can be considered as organized and constrained self- 
exploration processes [HJ |3U1 [EH1 11051 15U] . In the regression setting, they are 
used to learn a regression mapping between an input space X and an output 
space Y while minimizing the sample complexity, i.e. with a minimal number 
of examples necessary to reach a given performance level. These methods, typi- 
cally beginning by random and sparse exploration, build meta-models of perfor- 
mances of the motor learning mechanisms and concurrently drive the exploration 
in various sub-spaces for which a notion of interest is defined, often consisting in 
variants of expected informational gain. A large diversity of criteria can be used 
to evaluate the utility of given sampling candidates, such as the maximization 
of prediction errors |118j , the local density of already queried points [131] , the 
maximization of the decrease of global model variance |30j . expected improve- 
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ment [5T], or maximal uncertainty of the model |119j among others. There have 
been active-extensions to most of the existent learning methods, e.g. logistic 
regression [93], support vector machines 103 , gaussian processes [53 | 156 ] 157]. 
Only very recently have these approaches been applied to robotic problems, and 
even more recently if we consider examples with real robots. Nevertheless exam- 
ples that consider robotic problems already exist for a large variety of problems: 
building environment maps |67l 1118j , reinforcement learning |12j , body schema 
learning [25], imitation (SUEZ], exploration of objects and body properties [77], 
manipulation [46], among many others. 

Another approach to exploration came from an initially different problem, 
that of understanding how robots could achieve cumulative and open-ended 
learning autonomously. This raised the question of the task-independent mech- 
anisms that may allow a robot to get interested in practicing skills and learn 
new tasks that were not specified at design time. Two communities of re- 
searchers, the first one in reinforcement learning |115l [9"6l IT2l UOOj . the second 
one in developmental robotics [3FJ HU [75] [77J US] , formalized, implemented and 
experimented several mechanisms based on the concept of intrinsic motivation 
(sometimes called curiosity-driven learning) grounded into theories of motiva- 
tion, spontaneous exploration, free play and development in humans [130, 35, 13 
as well as in recent findings in the neuroscience of motivation [1041 1521 153"] . 

As argumented in [95] Q21 [5] [55] , architectures based on intrinsically moti- 
vated learning can be conceptualized as active learning mechanisms which, in 
addition to allowing for the self-organized formation of behavioral and develop- 
mental complexity, can also also allow an agent to efficiently learn a model of the 
world by parsimoniously designing its own experiments/queries. Yet, in spite of 
these similarities between work in active learning and intrinsic motivation, these 
two strands of approaches often differ in their underlying assumptions and con- 
straints, leading to sometimes very different active learning algorithms. In many 
active learning models, one often assumes that it is possible to learn a model 
of the complete world within lifetime, and/or that the world is learnable every- 
where, and/or where noise is homogeneous everywhere. Given those assump- 
tions, heuristics based on the exploration of parts of the space where the learned 
model has maximal uncertainties or where its prediction are maximally wrong 
are often very efficient. Yet, these assumptions typically do not hold in real world 
robots in an unconstrained environment: the sensorimotor spaces, including the 
body dynamics and its interactions with the external world, are simply much too 
large to be learned entirely in a life time; there are typically subspaces which are 
unlearnable due to inadequate learning biases or unobservable variables; noise 
can be strongly homogeneous. Thus, different authors claimed that typical crite- 
ria used in traditional active learning approaches, such as the search for maximal 
uncertainty or prediction errors, might get trapped or become inefficient in sit- 
uations that are common in open-ended robotic environments [1001 1771 [5] 1101] . 
This is the reason why new active learning heuristics have been proposed in 
developmental robotics, such as those based on the psychological concept of in- 
trinsic motivations 90 , 35] [14] which relate to mechanisms that drive a learning 
agent to perform different activities for their own sake, without requiring any 
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external reward [SSI El EM EDI [SSI HH EH EZ1 HH El ESI [SH EDI- Different 
criteria were elaborated, such as the search for maximal reduction in empir- 
ically evaluated prediction error, maximal compression progress, or maximal 
competence progress [91)1 11001 [77] . For instance, the architecture called Robust- 
Intelligent Adaptive Curiosity (RIAC) [5], which is a refinement of the IAC 
architecture which was elaborated for open-ended learning of affordances and 
skills in real robots [77] , defines the interestingness of a sensorimotor subspace 
by the velocity of the decrease of the errors made by the robot when predicting 
the consequences of its actions, given a context, within this subspace. As shown 
in [77J [8] , it biases the system to explore subspaces of progressively increasing 
complexity. 

Nevertheless, RIAC and similar "knowledge based" approaches (see |74j ) 
have some limitations: first, while they can deal with the spatial or temporal 
non-stationarity of the model to be learned, they face the curse-of-dimensionality 
and can only be efficient when considering a moderate number of control dimen- 
sions (e.g. up to 9/10). Indeed, as many other active learning methods, RIAC 
needs a certain level of sampling density in order to extract and compare the 
interest of different areas of the space. Also, because performing these measure 
costs time, this approach becomes more and more inefficient as the dimension- 
ality of the control space grows [TS]. Second, they focus on the active choice 
of motor commands and measures of their consequences, which allows learning 
forward models that can be re-used as a side effect for achieving goals/tasks 
through online inversion: this approach is sub-optimal in many cases since it 
explores in the high-dimensional space of motor commands and consider the 
achievement of tasks only indirectly. 

A more efficient approach consists in directly actively exploring task spaces, 
which are also often much lower-dimensional, by actively self-generating goals 
within those task spaces, and then learn associated local coupled forward/inverse 
models that are useful to achieve those goals. Yet, as we will see, the process is 
not as straightforward as learning the forward model, since because of the space 
redundancy it is not possible to learn directly the inverse model (and this is 
the reason why learning the forward model and then only inversing it has often 
been achieved). In fact, exploring the task space will be used to learn a sub- 
part of the forward model that is enough for reaching most of reachable parts in 
the task space through local inversion and regression, leveraging techniques for 
generalizing policy parameters corresponding to novel task parameters based on 
previously learnt correspondences, such as in [2U1 [51 1551 1108] . 

1.3 Driving the Exploration at a Higher Level 

In a framework where a system should be able to learn to perform a maximal 
amount of different tasks (here this means achieving many goals/tasks in a pa- 
rameterized task space) before focusing on different ways to perform the same 
tasks (here this means finding several alternative actions to achieve the same 
goal), knowledge-based exploration techniques like RIAC cannot be efficient in 
robots with redundant forward models. Indeed, they typically direct a robotic 
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system to spend copious amounts of time exploring variations of action policies 
that produce the same effect, at the disadvantage of exploring other actions that 
might produce different outcomes, useful to achieve more tasks. An example 
of this is learning 10 ways to push a ball forward instead of learning to push a 
ball in 10 different directions. One way to address this issue is to take inspi- 
ration infant's motor exploration/babbling behavior, which has been argued to 
be teleological via introducing goals explicitly inside a task space and driving 
exploration at the level of these goals |126| 187] \124\ 1123] , Once a goal/task is 
chosen, the system would then try to reach it with a lower-level goal-reaching ar- 
chitecture typically based on coupled inverse and forward models, which might 
include a lower-level goal-directed active exploration mechanism. 

Two other developmental constraints, playing an important role in infant 
motor development, and presented in the experimentations of this paper, can 
also play an important role when considering such a task-level exploration pro- 
cess. First, we use motor synergies which have been shown as simplifying mo- 
tor learning by reducing the number of dimensions for control (nevertheless, 
even with motor synergies, the dimensionality of the control space can easily 
go over several dozens, and exploration still needs to be organized). These 
motor synergies are often encoded using Central Pattern Generators (CPG) 
[321 123 1231 ISH US] or as more traditional innate low-level control loops which 
are part of the innate structure allowing a robot to bootstrap the learning of 
new skills, as for example in [771 147] where it is combined with intrinsically 
motivated learning. Second, we will use a heuristic inspired by observations 
of infants who sometimes prepare their reaching movements by starting from 
a same rest position [17], by resetting the robot to such a rest position, which 
allows reducing the set of starting states used to perform a task. 

In this paper, we propose an approach which allows us to transpose some 
of the basic ideas of IAC and RIAC architectures, combined with ideas from 
the SSA algorithm, into a multi-level active learning architecture called Self- 
Adaptive Goal Generation RIAC algorithm (SAGG-RIAC) (an outline 
and initial evaluation of this architecture was presented in [5]). Unlike RIAC 
which was made for active learning of forward models mapping action policy 
parametes to effects in a task space, we show that this new algorithm allows 
for efficient learning of inverse models mapping parameters of tasks to param- 
eters of action policies that allow to achieve these tasks in redundant robots. 
This is achieved through active sampling of novel parameterized tasks in the 
task space, based on a measure of competence progress, each of which triggers 
low-level goal-directed learning of the motor policy parameters that allow to 
solve it. This takes advantage of both the typical redundancy of the mapping 
and of the fact that very often the dimensionality of the task space considered 
is much smaller than the dimensionality of motor primitives/ action parameter 
space. Such an architecture also leverages both techniques for optimizing action 
policy parameters for a single predefined tasks (e.g. [7^1 1112) ). as well as re- 
gression techniques allowing to infer the motor policy parameters corresponding 
to a given novel parameterized task, and based on the previously learnt corre- 
spondences between policy and task parameters (e.g. [20 ] 18] 155 ] I108J ) . While 
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approaches such as [7JH 1112] or [301 [SSJ 1108) do not consider the problem of 
autonomous life-long exploration of novel parameterized tasks, they are very 
complemetary to the present work as they could be used as the low-level tech- 
niques for low-level learning of action parameter policies for self-generated tasks 
in the SAGG-RIAC architecture. 

SAGG-RIAC can be considered as an active learning algorithm carrying out 
the concept of competence based intrinsically motivated learning |74j and is in 
line with concepts of mastery motivation, Flow, Optimal Level theories and 
zone of Proximal Development introduced in psychology [37j EJ H21 Efl I127j . In 
a competence based active exploration mechanism, according to the definition 
[7J], the robot is pushed to perform an active exploration in the goal/operational 
space as opposed to motor babbling in the actuator space. 

Several strands of previous research have began exploration various aspects 
of this family of mechanisms. First, algorithms achieving competence based 
exploration and allowing general computer programs to actively and adaptively 
self-generate abstract computational problems, or goals, of increasing complex- 
ity were studied in a theoretical computer science perspective [551 [Ml 1102] . 
While the high expressivity of these formalisms allows in principle to tackle 
a wide diversity of problems, they were not designed nor experimented for the 
particular family of problems of learning high-dimensional continuous models in 
robotics. While SAGG-RIAC also actively and adaptively self-generates goals, 
this is achieved with a formalism based on applied mathematics and dedicated 
to the problem of learning inverse models in continuous redundant spaces. 

Measures of interestingness based on a measure of competence to perform a 
skill were studied in [6j, as well as in [94] where a selector chooses to perform 
different skills depending on the temporal difference error to reach each skill. 
The study proposed in |lllj is based on the competence progress, which they use 
to select goals in a pre-specified set of skills considered in a discrete world. As 
we will show, SAGG-RIAC also uses competence progress, but targets learning 
in high-dimensional continuous robot spaces. 

A mechanism for passive exploration in the task space for learning inverse 
models in high-dimensional continuous robotics spaces was presented in [85, 86, , 
where a robot has to learn its arm inverse kinematics while trying to reach in 
a preset order goals put on a pre-specified grid informing the robot about the 
limits of its reachable space. In SAGG-RIAC exploration is actively driven in 
the task space, allowing the learning process to minimize its sample complexity, 
and as we will show, to reach a high-level of performances in generalization and 
to discover automatically its own limits of reachability. 

In the following sections we introduce the global architecture and formal- 
ization of the Self- Adaptive Goal- Generation SAGG-RIAC architecture. Then, 
we study experimentally its capabilities to allow a robot efficiently and actively 
learn distributions of parameterized motor skills/policies that solve a corre- 
sponding distribution of parameterized tasks/goals, and in the context of three 
experimental setups: 1) learning the inverse kinematics in a highly-redundant 
robotic arm, 2) learning omnidirectional locomotion with motor primitives in 
a quadruped robot, 3) an arm learning to control a fishing rod with a flexible 
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wire. More precisely, we focus on the following aspects and contributions of the 
architecture: 

• SAGG-RIAC creates developmental trajectories driving the robot to pro- 
gressively focus on tasks of increasing complexity of learnability; 

• Drives the learning of a high variety of parameterized tasks (i.e. capability 
to reach various regions of the goal/task space) instead of numerous ways 
to perform the same task; 

• Allows learning fields of tasks in high-dimensional high-volume control 
spaces as long as the task space is low-dimensional (it can be high- volume) ; 

• Allows learning in task-spaces where only small and initially unknown 
subparts are reachable; 

• Drives the learning of inverse models of highly-redundant robots with dif- 
ferent body schemas; 

• Guides the self-discovery of the limits of what the robot can achieve in its 
task space; 

• Allows improving significantly the quality of learned inverse models in 
terms of speed of learning and generalization performance to reach goals in 
the task space, compared to different methods proposed in the literature; 



2 Competence Based Intrinsic Motivation: The 
Self- Adaptive Goal Generation RIAC Archi- 
tecture 

2.1 Global Architecture 

Let us consider the definition of competence based models outlined in [71] , and 
extract from it two different levels for active learning defined at different time 
scales (Fig. [I]): 

1. The higher level of active learning (higher time scale) takes care of the 
active self-generation and self-selection of goals/tasks in a parameterized 
task space, depending on a measure of interest based on the level of com- 
petences to reach previously generated goals (e.g. competence progress); 

2. The lower level of active learning (lower time scale) considers the goal- 
directed active choice and active exploration of lower-level actions to be 
taken to reach the goals selected at the higher level, and depending on 
local measures of interest related to the evolution of the quality of learned 
inverse and/or forward models; 
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Figure 1: Global Architecture of the SAGG-RIAC architecture. The structure 
is comprised of two parts defining two levels of active learning: a higher level 
which considers the active self-generation and self-selection of goals, and a lower 
level, which considers the goal-directed active choice and active exploration of 
low-level actions, in order to reach the goals selected at the higher level. 



2.2 Model Formalization 

Let us consider a robotic system described in both a state/context space S, 
and a task space Y which is a field of parameterized tasks/goals that can be 
viewed as defining a field of parameterized reinforcement learning problems. For 
a given context s G S, a sequence of actions a = {ai, 02, a n } £ A, potentially 
generated by a parameterized motor synergy irg : S — > A (alternatively called 
an "option" and including a self-termination mechanism), allows a transition 
toward the new states y <E Y such that (s,a) — > y, also written (s,ire) — > 
y. For instance, in the first experiment introduced in the following sections 
where we use a robotic manipulator, S represents its actuator/joint space, Y 
the operational space corresponding to the cartesian position of its end-effector, 
and A relates to velocity commands in the joints. Also, in the second experiment 
involving a quadruped where we use motor synergies, the context s is always 
reset to a same state and has thus no influence on the learning, A relates to the 
24 dimensional parameters of a motor synergy which considers the frequency 
and amplitude of sinusoids controlling the position of each joints over time, and 
Y relates to the position and orientation of the robot after the execution of the 
synergy during a fixed amount of time. 

SAGG-RIAC drives the exploration and learning of how to reach goals given 
starting contexts/states. Starting states are formalized as configurations s e S 
and goals as a desired y g G Y '. All states are considered to be potential starting 
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states; therefore, once a goal has been generated, the low-level goal directed 
exploration and learning mechanism always tries to reach it by starting from 
the current state of the system as formalized and explained below. 

When the initiation position s star t, the goal y g and constraints p (e.g. linked 
with the spent energy) are chosen, it generates a motor policy irg(Data)( s start, yg, p) 
parameterized by s start, Vg and p as well as parameters 9 of internal forward 
and inverse models already learned with previously acquired data Data. Also, it 
is important to notice that i^e(Data){sstart, yg, p) can be computed on the fly, as 
in the experiments below, with regression techniques allowing to infer the motor 
policy parameters corresponding to a given novel parameterized task, and based 
on the previously learnt correspondences between policy and task parameters, 
such as in [201 151 1551 HQS] . 

We can make an analogy of this formalization with the Semi-Markov Option 
framework introduced by Sutton [116] . In the case of SAGG-RIAC, when con- 
sidering an option (I,ir,f3), we can first define the initiation set / : S — >• [0; 1], 
where / is true everywhere, because, as presented before, every state can here 
be considered as a starting state. Also, goals are related to the terminal con- 
dition (3 and j3 = 1 when the goal is reached, and the policy ir encodes the 
skill learned through the process induced by the lower-level of active learning 
and shall be indexed by the goal y g , i.e. n y . More formally, as induced by 
the use of semi-markov options, we define policies and termination conditions 
as dependent on all events between the initiation of the option, and the current 
instant. This means that the policy 7r, and /3 are depending on the history 
ht T = {st, at, St+i, a t+ i..., s T } where t is the initiation time of the option, and t, 
the time of the latest event. Denoting the set of all histories by f2, the policy and 
termination condition become defined by ir : 51 x A — > [0; 1] and (3 : — > [0; 1]. 

Moreover, because we have to consider cases where goals are not reachable 
(either because of physical impossibility or because the robot is not capable of 
doing it at that point of its development), we need to define a timeout t max 
which can stop a goal reaching attempt once a maximal number of actions has 
been executed. h tT is thus needed to stop 7r, (i.e. the low-level active learning 
process), if r > t max . 

Eventually, using the framework of options, we can define the process of 
goal self-generation, as the self-generation and self-selection of parameterized 
options, and a goal reaching attempt corresponding to the learning of a par- 
ticular option. Therefore, the global SAGG-RIAC process can be described as 
exploring and learning fields of options. 

2.3 Lower Time Scale: 

Active Goal Directed Exploration and Learning 

In SAGG-RIAC, once a goal has been actively chosen at the high-level, the goal 
directed exploration and learning mechanism at the lower can be carried out in 
numerous ways: the architecture makes only little assumptions about them, and 
thus is compatible with many methods such as those described below (this is 
the reason why SAGG-RIAC is an architecture defining a family of algorithms). 
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Its main idea is to guide the system toward the goal by executing low-level 
actions which allow a progressive exploration of the world toward this specific 
goal and that updates at the same time the local corresponding forward and 
inverse models, leveraging previously learnt correspondences with regression. 
The main assumptions about the methods that can be used for this lower level 
are: 

• Incremental learning and generalization: based on the data collected 
incrementally, the method must be able to build incrementally local for- 
ward and inverse models that can be reused later on, in particular when 
considering other goals, such as the task-space regression techniques pre- 
sented in [27fll£ll551ITn%]: 

• Goal-directed optimization: when a goal is set, an optimization pro- 
cedure can improve the parameters of the action policy to reach the goal, 
such as policy gradient methods [75J Hl^j or stochastic optimization [35] ; 

A optional feature, which is a variant of the second assumption above, is: 

• Active optimization: goal-directed optimization of the parameters of 
the action policy for reaching a self-generated goal. A learning feedback 
mechanism has to be added such that the exploration is active, and the 
selection of new actions depends on local measures about the quality of 
the learned model. 

In the following experiments that will be introduced, we will use two different 
methods: one mechanism where optimization is inspired by the SSA algorithm 
[92], coupled with memory-based local forward and inverse regression models 
using local Moore-Penrose pseudo-inverses, and a more generic optimization al- 
gorithm mixing stochastic optimization with memory-based regression models 
using pseudo-inverse. Other kinds of techniques could be used. For the op- 
timization part, algorithms such as natural actor-critic architectures in model 
based reinforcement learning [75] , algorithms of convex optimization [33] , algo- 
rithms of stochastic optimization like CMA (e.g. [45]), or path-integral methods 

(e.g. mm)- 

For the regression part, we are here using a memory-based approach, which 
if combined with efficient data storage and access structures [HEO], scales well 
from a computational point of view. Yet, if memory limits would be a lim- 
ited resource, and as little assumption about the low-level regression algorithms 
are made in the SAGG-RIAC architecture, parameterized models allowing to 
control memory requirements such as Neural networks, Support Vector Regres- 
sion, Gaussian Process Regression could instead be considered [107j, such as in 
[2771 151 1551 ITU5]. 

2.4 Higher Time Scale: 

Goal Self-Generation and Self-Selection 

The Goal Self-Generation and Self-Selection process relies on a feedback de- 
fined using the concept of competence, and more precisely on the competence 
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improvement in given regions (or subspaces) of the task space where goals are 
chosen. The measure of competence can be computed at different instants of the 
learning process. First, it can be estimated once a reaching attempt in direction 
of a goal has been declared as terminated. Second, for robotic setups which 
are compatible with this option, competence can be computed during low-level 
reaching attempts. In the following sections, we detail these two different cases: 

2.4.1 Measure of Competence for a Terminated Reaching Attempt 

A reaching attempt for a goal is considered terminated according to two condi- 
tions: 

• A timeout related to a maximum number of iterations allowed by the 
low-level of active learning has been exceeded. 

• The goal has effectively been reached. 

We introduce a measure of competence for a given goal reaching attempt as 
dependent on two metrics: the similarity between the point in the task space 
yf attained when the reaching attempt has terminated, and the actual goal y g ; 
and the respect of constraints p. These conditions are represented by a cost, or 
competence, function C defined in [—00; 0], such that higher C(y g , yf, p) will be, 
the more a reaching attempt will be considered as efficient. From this definition, 
we set a measure of competence r yg directly linked with the value of C(y g , y/,p): 



where e sim is a tolerance factor such that C(y g ,yf, p) > e sim corresponds 
to a goal reached. We note that a high value of Y y (i.e. close to 0) represents 
a system that is competent to reach the goal y g while respecting constraints p. 
A typical instantiation of C, without constraints p, is defined as C(j/ g ,y/,0) = 
— \\y g — Vf\\ 2 , and is the direct transposition of prediction error in RIAC (TTj [S] to 
the task space in SAGG-RIAC. Yet, this competence measure might take some 
other forms in the SAGG-RIAC architecture, such as the variants explored in 
the experiments below. 

2.4.2 Measure of Competence During a Reaching Attempt or Dur- 
ing Goal-Directed Optimization 

When the system exploits its previously learnt models to reach a goal y g , us- 
ing a computed 7rg through adequate local regression, or when it is using the 
low-level goal-directed optimization to optimize the best current 7r# to reach 
a self-generated goal y g , it does not only collect data allowing to measure its 
competence to reach y g , but since the computed ng might lead to a different 
effect y e ^ y g , it also allows to collect new data for improving the inverse model 
and the measure of competence to reach other goals in the locality of y e . This 
allows to use all experiments of the robot to update the model of competences 
over the space of paremeterized goals. 




C(y g ,yf,p) if C(y g ,y f ,p) < e slm < 



otherwise 
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2.4.3 Definition of Local Competence Progress 

The active goal self-generation and self-selection relies on a feedback linked 
with the notion of competence introduced above, and more precisely on the 
monitoring of the progress of local competences. We first need to define this 
notion of local competence. Let us consider a subspace called a region R C 
Y. Then, let us consider different measures of competence T yi computed for 
different attempted goals yi G R, in a time window consisting of the C last 
attempted goals. For the region R, we can compute a measure of competence 
F that we call a local measure such that: 

{ \R\ 

with \R\, cardinal of R. 

Let us now consider different regions Ri of Y such that Ri C Y, |L Ri = 
Y (initially, there is only one region which is then progressively and recur- 
sively split; see below and see Fig. |2|. Each Ri contains attempted goals 
{HhM ,yi 2 ,t 2 ,-, Vi k ,t k }Ri and corresponding competences obtained {T ViuH , T Vi2 ^ Ty^^ } Ri , 
indexed by their relative time order of experimentation ti < t 2 < ••■ < ifc|in+i = 
t n + 1 inside this precise subspace Ri (ti are not the absolute time, but integer 
indexes of relative order in the given region). 

An estimation of interest is computed for each region Ri. The interest 
inter esti of a region Ri is described as the absolute value of the derivative of local 
competences inside Ri, hence the amplitude of local competence progress, over 
a sliding time window of the £ more recent goals attempted inside Ri (equation 




interesti 





c 



(2) 



By using a derivative, the interest considers the variation of competences, and 
by using an absolute value, it considers cases of increasing and decreasing com- 
petences. In SAGG-RIAC, we will use the term competence progress with its 
general meaning to denote this increase and decrease of competences. 

An increasing competence signifies that the expected competence gain in 
Ri is important. Therefore, potentially, selecting new goals in regions of high 
competence progress could bring both a high information gain for the learned 
model, and also drive the reaching of not previously achieved goals. 

Depending on the starting position and potential evolution of the environ- 
ment or of the body (e.g. breaking of a body part), a decrease of competences 
inside already well-reached regions can arise. In this case, the system should be 
able to focus again in these regions in order to at least verify the possibility to 
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Goal Space 




Figure 2: Task space and example of regions and subregions split during the 
learning process according to the competence level. Each region displays its 
competence level over time, measure which is used for the computation of the 
interest according to equation [2] 

re-establish a high level of competence inside. This explains the usefulness to 
consider the absolute value of the competence progress as shown in equation [2j 
Using a sliding window in order to compute the value of interest prevents 
the system from keeping each measure of competence in its memory, and thus 
limits the storage resource needed by the core of the SAGG-RIAC architecture. 

2.4.4 Goal Self- Generation Using the Measure of Interest 

Using the previous description of interest, the goal self-generation and self- 
selection mechanism carries out two different processes: 

1. Splitting of the space Y where goals are chosen, into subspaces, according 
to heuristics that allows to maximally discriminate areas according to their 
levels of interest. 

2. Selecting the next goal to perform. 

Such a mechanism has been described in the RIAC algorithm introduced in [5] , 
but was previously applied to the actuator space S rather than to the goal/task 
space Y as is done in SAGG-RIAC. Here, we use the same kind of methods 
such as a recursive split of the space, each split being triggered once a prede- 
fined maximum number of goals g max has been attempted inside. Each split 
is performed such that it maximizes the difference of the interest measure de- 
scribed above in the two resulting subspaces. This allows the easy separation 
of areas of differing interest and therefore of differing reaching difficulty. More 
precisely, here the split of a region R n into R n +i and R n +2 is done by select- 
ing among m randomly generated splits, a split dimension j £ \Y\ and then a 
position Vj such that: 

• All the yi of R n +i have a j th component smaller than Vj\ 

• All the yi of R n +2 have a j th component higher than vj; 
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• The quantity Qual(j, vj) — card(R n +i)xard(R n +2)-\\interestR n+1 —interestii n+2 \\ 
is maximal; 

Finally, as soon as at least two regions exist after an initial random explo- 
ration of the whole space, goals are chosen according to the following heuristics, 
selected according to probabilistic distributions: 

1. mode(l): inpi% percent (typically p\ = 70%) of goal selections, a random 
goal is chosen along a uniform distribution inside a region which is selected with 
a probability proportional to its interest value: 

interest n — vaxTaiinteresti) 
/-Ji=i mteresti — mm{mteresti) 

Where P n is the selection probability of the region R n , and inter esti corresponds 
to the current interest of the region 

2. mode{2): in p 2 % (typically P2 = 20% of cases), a random goal is chosen 
inside the whole space Y. 

3. mode(3): in p 3 % (typically p 3 = 10%), a region is first selected according to 
the interest value (like in mode(l)) and then a new goal is generated close to 
the already experimented one which received the lowest competence estimation. 

2.4.5 Reduction of the Initiation Set 

In order to improve the quality of the learned inverse model, we add a heuristic 
inspired by two observations on infant motor exploration and learning. The first 
one, proposed by Berthier et al. [T7] is that infant's reaching attempts are often 
preceded by movements that either elevate their hand or move their hand back 
to their side. And the second one, noticed in [85], is that infants do not try to 
reach for objects forever but sometimes relax their muscles and rest. Practically, 
these two characteristics allow them to reduce the number of initiation positions 
that they use to reach an object, which simplifies the reaching problem by letting 
them learn a reduced number of reaching movements. 

Such mechanism can be transposed in robotics to motor learning of arm 
reaching tasks as well as other kind of skills such as locomotion or fishing as 
shown in experiments below. In such a framework, it directly allows a highly- 
redundant robotic system to reduce the space of initiation states used to learn to 
reach goals, and also typically prevent it from experimenting with too complex 
actuator configurations. We add such a process in SAGG-RIAC, by specifying 
a rest position (s res t,y r est) reachable without any need of planning from the 
system, that is set for each r subsequent reaching attempts (we call r the reset 
value, with r > 0). 

2.5 New Challenges of Unknown Limits of the Task Space 

In traditional active learning methods and especially knowledge-based intrinsi- 
cally motivated exploration [TTJ |77[ EH |96j |2T] , the system is typically designed 
to select actions to perform inside a set of values inside an already known interval 
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(for instance, the range of angles that can be taken by a motor, or the phases 
and amplitudes of CPGs which can be easily identified). In these cases, the 
challenge is to select which areas would potentially give the most information 
to the system, to improve its knowledge, inside this fixed range of possibilities. 
As argued earlier, a limit of these approaches is that they become less and less 
efficient as the dimensionality of the control space increases. Competence based 
approaches allow to address this issue when a low-dimensional task space can 
be identified. Nevertheless, in that case, a new problem arises when consider- 
ing unbounded learning: the space where goals are reachable can be extremely 
large and it is generally very difficult to predict its limits and undesirable to 
ask the engineer to identify them. Therefore, when carried out in large spaces 
where the reachable area is only a small part of it, the algorithm could neces- 
sitate numerous random goal self-generations to be able to estimate interests 
of different subregions. In order to reduce this number, and help the system 
to converge easily toward regions where competence can be improved, we em- 
phasize two different mechanisms that can be used in SAGG-RIAC, during a 
reaching attempt: 

1. Conservation of every point reached inside the task space even if they do 



not correspond to the attempted goal (see section 2.4.2): when the robot 
performs a reaching attempt toward a goal y, and, instead of reaching 
it, terminates at another state y' ', we consider y' as a goal reached with 
a value of competence depending on constraints p. In cases where no 
constraints are studied, we can consider the y' as another goal reached 
with the highest level of competence. 

2. Addition of subgoals: in robotic setups where the process of goal reaching 
can be subdivided and described using subgoals which could be fixed on 
the pathway toward the goal, we artificially add states z/i, 7/2, 2M that 
have to be reached before y while also respecting the constraints p, and 
estimate a competence measure for each one. 

The consideration of these two heuristics has important advantages: first, they 
can significantly increase the number of estimations of competence, and thus 
the quantity of feedback returned to the goal self-generation mechanism. This 
reduces the number of goals that have to be self-generated to bootstrap the sys- 
tem, and thus the number of low-level iteration required to extract first inter- 
esting subspaces. Also, by creating areas of different competence values around 
already reached states, they influence the discovery of reachable areas. Finally, 
they result in an interesting emergent phenomena: they create a growing area 
of increasing competence around the first discovered reachable areas. Indeed, 
by obtaining values of competences inside reachable areas, the algorithm is able 
to split the space first in these regions, and compute values of interest. These 
values of interest will typically be high in already reached areas and influence 
the goal self-generation process to create new goals in its proximity. Once the 
level of competence becomes important and stabilized in efficiently reached ar- 
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eas, the interest becomes null, then, new areas of interest close to these ones 
will be discovered, and so on. 

2.6 Pseudocode 

Pseudo-code [l] and algorithm [2] present the flow of operations in the SAGG- 
RIAC architecture. Algorithms [3] and [4] are simple alternative examples of low- 
level goal-directed optimization algorithms that are used in the experimental 
section, but they could be replaced by other algorithms like PI 2 — CM A [112] , 
CM A [3S]j or those presented in (75]. The function Inefficient can also be built 
in numerous manners and will not be described in details in the pseudo-code 
(examples will be described then for each experimentation). Its function is to 
judge if the current model has been efficient enough to reach or come closer to 
the decided goal, or if the model has to be improved in order to reach it. 

In the following sections, we will present two different kinds of experiments. 
The first one is a reaching experiment where a robotic arm has to learn its 
inverse kinematics to reach self-generated end-effector positions. It uses an 
evolving context s € S, also called setpoint in SSA, representing its current 
joint configuration. Therefore, it can be described by the relationship (s, a) — > y 
where s, a and y can evolve. It is thus possible to use a goal-directed optimization 
algorithm very similar to SSA in this experiment, like the one in algorithm [3] 

In the two other experiments, in contrast, we control the robots using pa- 
rameterized motor synergies and consider a fixed context (a rest position) seS 
where the robot is reset before each action: we will first consider a quadruped 
learning omnidirectional locomotion, and then an arm controlling a flexible fish- 
ing rod learning to put the float in precise self-generated positions on top of the 
water. Thus, these systems can be described by the relationship (s,irg) — > y, 
where s will here be fixed and 8 will be the parameters of the motor synergy 
used to control the robots. Thus, a variation of setpoint being prevented here, 
a variant of SSA will be proposed for such experiments (similar to a more tra- 
ditional optimization algorithm), where the context will not evolve and always 
be reset, like in algorithm [3| 

3 Experimental Setup 1: Learning Inverse Kine- 
matics with a Redundant Arm 

In this section, we propose an experiment carried out with a robotic arm which 
has to explore and learn its forward and inverse kinematics. Also, before 
discussing the details of our active exploration approach in this first experi- 
mentation case, we firstly define the representations of the models and control 
paradigms involved in this experiment. Here, we focus on robotic systems whose 
actuators are settable by positions and velocities, and restrict our analysis to 
discrete time models. 

Allowing robots to be self-adaptive to environmental conditions and changes 
in their own geometry is an important challenge of machine learning. These 
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changes in the robot geometry directly have an impact on its Inverse Kinemat- 
ics IK, relating workspace coordinates (where tasks are usually specified), to 
actuators coordinates (like joint position, velocity, or torque used to command 
the robot). Learning inverse kinematics is useful in numerous machine learning 
cases, such as when no accurate kinematic model of a robot is available or when 
an online calibration is needed due to sensor or motor imprecision. Moreover, in 
developmental robotics studies, the a priori knowledge of a precise model of the 
body is often avoided, because of its implausibility from the biological point of 
view. In the following experiment, we assume that the inverse kinematics of our 
system is totally unknown, and we are interested in studying how SAGG-RIAC 
can efficiently guide the discovery and learning of its inverse kinematics. 

3.1 Control Paradigms for Learning Inverse Kinematics 

Let us mathematically formulate forward and inverse kinematics relations. We 
define the intrinsic coordinates (joint/actuator positions) of a manipulator as 
the n-dimensional vector S = a £ WL n , and the position and orientation of the 
manipulator's end-effector as the m-dimensional vector y £ M™. Relative to 
this formalization, actions of the robot corresponds to speed commands param- 
eterized by a vector 8 = a € R™ which controls the instantaneous speed of each 
of the n joints of the arm. The forward kinematic function of this system is 
generally written as y = f(a), and inverse kinematics relationship is defined as 

When a redundant manipulator is considered (n > m), or when m = n, 
solutions to the inverse relationship are generally non-unique [106 . The problem 
posed to inverse learning algorithms is thus to determine particular solutions to 
a = ,f~ 1 (y), when multiple solutions exists. A typical approach used for solving 
this problem considers local methods, which learn relationships linking small 
changes A a and Ay : 

y = J(a)a (4) 

where J(a) is the Jacobian matrix. 

Then, using the Jacobian matrix and inverting it to get a single solution a 
corresponding to a desired y raises the problem of the non-convexity property 
of this last equation. A solution to this non-convex problem has then been 
proposed by Bullock in [53] who converted it into a convex problem, by only 
considering the learning task within the spatial vicinity a of a particular a : 

IS = J{a)a (5) 

3.2 Representation of Forward and Inverse Models to be 
Learnt 

We use here non-parametric models which typically determine local models in 
the vicinity of a current datapoint. By computing a model using parameterized 
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Figure 3: Values used to compute the competence T y , considering a manipu- 
lator of 7 degrees-of-freedom, in a 2 dimensions operational/task space. Here, 
the arm is set in a position called rest position which is not straight and slightly 
bent. (a rest ,y rest ). 

functions on datapoints restrained to a locality, they have been proposed as 
useful for real time queries, and incremental learning. Learning inverse kine- 
matics typically deals with these kind of constraints, and these local methods 
have thus been proposed as an efficient approach to IK learning [1251 1107] . In 
the following study, we use an incremental version of the Approximate Near- 
est Neighbors algorithm (ANN) [70], based on a tree split using the k-means 
process, to determine the vicinity of the current a. Also, in the environments 
that we use to introduce our contribution, we do not need highly robust, and 
computationally very complex regression methods. Using the pseudo-inverse of 
Moore-Penrose [2] to compute the pseudo-inverse J + (a) of the Jacobian J (a) in 
a vicinity a is thus sufficient. Possible problems happening due to singularities 
[1061 l2"51lin] being bypassed by adding noise in the joint configurations (see [55] 
for a study about this problem). 

Also, in the following equation, we use this method to deduce the change 
Aa corresponding to a Aa;, for a given joint position a: 

a = J+(a)y (6) 

3.3 Robotic Setup 

In the following experiments, we consider a n-dimensional manipulator con- 
trolled in position and speed (as many of today's robots), updated at discrete 
time values. The vector a £ K" which represents joint angles corresponds to the 
context/state space S and the vector y £ K m which is the position of the ma- 
nipulator's end-effector in m dimensions in the Euclidian space K m corresponds 
to the task space Y (see Fig. [3] where n — 7 and m = 2). We evaluate how 
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the SAGG-RIAC architecture can be used by a robot to learn how to reach all 
reachable points in the environment Y with this arm's end-effector. Learning the 
inverse kinematics is here an online process that arises each time a micro-action 
9 = Aa e A is executed by the manipulator: by doing each micro-action, the 
robot stores measures (a, Aa, Ax) in its memory and creates a database Data 
which contains elements (a,, Aai, Aj/j) representing the discovered change Ay^ 
corresponding to a given Ao^ in the configuration on (this learning entity can 
be called a schema according to the terminology of Drescher [38]). These mea- 
sures are then reused online to compute the Jacobian J (a) — Ay/Aa locally 
to move the end-effector in a desired direction Ayg es i re d fixed toward the self- 
generated goal. Therefore, we consider a learning problem of 2n dimensions, 
the relationship that the system has to learn being (a,Aa) —> Ay. Also, in 
this experiment, where we suppose Y Euclidian, and do not consider obstacles, 
the direction to a goal can be defined as following a straight line between the 
current end-effector's position and the goal. 



3.4 Evaluation of Competence 

In this experiment, in order to clearly illustrate the main contribution of our 
algorithm, we do not consider constraints p and only focus on the reaching of 
goal positions y g . It is nevertheless important to notice that a constraint p has 
a direct influence on the low-level of active learning of SAGG-RIAC, and thus 
an indirect influence on the higher level. As using a constraint can require a 
more complex exploration process guided at the low-level, a more important 
number of iterations at this level can be required to reach a goal, which could 
have an influence on the global evolution of the performances of the learning 
process used by the higher-level of SAGG-RIAC. 

We define here the competence function C with the Euclidian distance 
D(y g ,yf), between the goal position and the final reached position yf, which 
is normalized by the starting distance D(y start ,y g ), where y start is the end- 
effector's starting position. This allows, for instance, to give a same competence 
level when considering a goal at 1cm from the origin position, which the robot 
approaches at 0.5cm and a goal at 1mm, which the robot approaches at 0.5mm. 



where C(y g ,yf ,y start) = if D(y sta rt,y g ) < e c (the goal is too close from 
the start position) and C(y g ,y f ,y start) = -1 if D(y g ,y f ) > D(y s t ar t,y g ) (the 
end-effector moved away from the goal). 



3.5 Addition of subgoals 

Computing local competence progress in subspaces/regions typically requires 
the reaching of numerous goals. Because reaching a goal can necessitate several 
micro-actions, and thus time, obtaining competence measures can be long. Also, 
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without biasing the learning process and as already explained in section 2J) we 
improve this mechanism by taking advantage of the Euclidian nature of Y: we 
increase the number of goals artificially, by adding subgoals on the pathway 
between the starting position and the goal, where competences are computed. 
Therefore, considering a starting state y 'start m Y ^ an d a self-generated goal y g , 
we define the set of I subgoals {yi,y 2 , —,yi} where yt = (i/l) x (y g - y star t), 
that have to be reached before attempting to reach the terminal goal y g . 

We also consider another way to increase the number of competence measures 
which is to take into consideration each experimented position of the end-effector 
as a goal reached with a maximal competence value. This will typically help 
the system to distinguish which regions are efficiently covered, and to discover 
new regions of interest. 



3.6 Active Goal Directed Exploration and Learning 

Here we propose a method inspired by the SSA algorithm to guide the system to 
learn on the pathway toward the selected goal position y g . This instantiation of 
the SAGG-RIAC architecture uses algorithm [3] and considers evolving contexts, 
as explained below. 



3.6.1 Reaching Phase 

The reaching phase deals with creating a pathway to the current goal position 
y g . This phase consists of determining, from the current position y c , an optimal 
micro-action which would guide the end-effector toward y g . For this purpose, 
the system computes the needed end-effector's displacement Ay next = v. ^Zy* y 
(where v is the velocity bounded by v max and ||^ c _y g || a normalized vector in 
direction of the goal), and performs the action Aa next = J + -Ay next , with J + , 
pseudo-inverse of the Jacobian estimated in the close vicinity of a and given 
the data collected^by the robot so far. After each action Ay next , we compute 
the error e = ||A$/ nea . t — Ay next \, and trigger the exploration phase in cases 
of a too high value e > e max > 0. s max is thus a parameter which has to be 
set depending on the range of error e that can be experienced, and will be set 
depending on a tolerance that can be conceded to allow reaching goal positions 
with the current learned data. While a too high value of e max will prevent 
exploring and learning new data (the system spending potentially too important 
amounts of time exploring around a same configuration and get trapped in local 
minima), too low values of e max will prevent an efficient local optimization. 



3.6.2 Exploration Phase 

This phase consists in performing q £ N small random explorative actions Aaj, 
around the current position a, where the variations can be derandomized such 
as in [45] . This allows the learning system to improve its regression model of 
the relationship (a, Aa) — ¥ Ay, in the close vicinity of a, which is needed to 
compute the inverse kinematics model around a. During both phases, a counter 
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is incremented for each micro-action and reset for each new goal. The timeout 
used to define a goal as unreached and to stop a reaching attempt uses this 
counter. A maximal quantity of micro-actions is fixed for each goal as directly 
proportional to the number of micro-action it requires to be reached. In the 
next experiments, the system is allowed to perform up to 1.5 times the distance 
between y start and y g before stopping the reaching attempt. 

3.7 Qualitative Results for a 15 DOF Simulated Arm 

In the simulated experiment introduced in this section, we consider the robotic 
arm presented Fig. [3] with 15 DOF, each limb of the robot having the same 
length (considering a 15 DOF arm corresponds to a problem of 32 continuous 
dimensions, with 30 dimensions in the actuator/state space and 2 dimensions 
in the goal/task space). We set the dimensions of the task space Y as bounded 
in intervals y g € [0; 150] x [—150; 150], where 50 units is the total length of the 
arm, which means that the arm covers less than 1/9 of the space Y where goals 
can be chosen (i.e. the majority of areas in the operational/task space are not 
reachable, which has to be self-discovered by the robot). We fix the number of 
subgoal per goal to 5, and the maximal number of elements inside a region before 
a split to g m ax = 50. We also set the desired velocity v — 2 units/micro-action, 
and the number of explorative actions q — 20. Moreover, we reset the arm to 
the rest position (a res t, y r est) (position displayed in Fig. [3]) every r — 1 reaching 
attempts. This allows reducing the initiation set and prevent the system from 
experimenting with too complex joint positions where the arm is folded, and 
where the Jacobian is more difficult to compute. Using a low value of r is an 
important characteristic for the beginning of the learning process. A too high 
value of T prevents learning rapidly how to achieve a maximal amount of goal 
position, due to the difficulty to reuse the previously learned data when the arm 
is folded in unknown positions. 

The bent character of the rest position is also useful to avoid to begin a 
micro-action close to a singularity like when the arm is totally unfolded. Also, 
in this experiment, we consider each experimented position of the end-effector 
as if it was a goal reached with the maximal competence level (these numerous 
positions are not displayed in the following figures in order to not overload the 
illustrations) . 

3.7.1 Evolution of Competences over Time 

Fig. [4] represents the whole distribution of self-generated goals and sub-goals 
selected by the higher-level of active learning module, and their corresponding 
competences after the execution of 30000 micro- act ions. The global shape of 
the distribution of points allows observing the large values of competence levels 
inside the reachable space and its close vicinity, and the global low competence 
inside the remaining space. 

The progressive increase of competences is displayed on Fig. [5] where we 
evaluate over time (indexed here by the number of goals self-generated) the 
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Figure 4: Competence values corresponding to the entire set of self-generated 
goals collected over an experiment of 30000 micro-actions on a 15 DOF arm. 
The heterogeneous set of competence values situated inside the reachable space 
illustrates the typical measures of competence that can be measured in this 
region over a whole experiment. For a visualization of the evolution of these 
competence values, see figure [5] 

global competence of the system to reach positions situated on a grid which 
covers the entire task space. From these estimations of competence, we can ex- 
tract two interesting phenomena: first of all, the two first subfigures, estimated 
after the self-generation of 42 and 83 goals, show that the system is, at the be- 
ginning of the exploration and learning process, competent to only attain areas 
situated close to the limits of the reachable space. Then, the 4 other subfigures 
show the progressive increase of competences inside the reachable space follow- 
ing an increasing radius whose the origin is situated around the end-effector rest 
position. 

The first observation is due to the reaching mechanism in itself, which, when 
possessing a few data acquired, does not allow the robot to experiment complex 
joint movements, but only simple ones which typically leads to the limits of 
the arm. The second phenomenon is due to the coupling of the lower-level of 
active learning inspired by SSA with the heuristic of returning to y re st every 
subsequent goals. Indeed, the necessity to be confident in the local model of 
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Figure 5: Evolution of competence values corresponding to self-generated goals 
collected during an experiment of 30000 micro-actions on a 15 DOF arm. Time 
is indexed by the number of self-generated goals. Higher values (dark red) corre- 
sponds to position that has been reached using learned data. (For interpretation 
of the references to color in this figure legend, the reader is referred to the web 
version of this article) 

the arm to shift toward new positions makes the system progressively explore 
the space, and resetting it to its rest position makes it progressively explore 
the space by beginning close to y res t- Finally, goal positions that are physically 
reachable but far from this radius typically present a low competence to be 
reached initially, before the radius spreads enough to reach them. 

3.7.2 Global Exploration over Time 

Fig. [6] shows histograms of goal positions self-generated during the execution 
of the 30000 micro-actions (only goals, not subgoals for an easy reading of the 
figure). Each subfigure corresponds to a specified time window indexed by the 
number of generated goals: the first one (upper-left) shows that, at the onset 
of learning, the system already focuses in a small area around the end-effector's 
rest position, and thus discriminates differences between a subpart of the reach- 
able area and the remaining space (the whole reachable zone being represented 
by the black half-circle on each subfigure of Fig. [6]). In the second subfigure, 
the system is, inversely, focusing almost only on regions of the space which are 
not reachable by the arm. This is due to the imprecise split of the space at 
this level of the exploration, which left very small reachable areas (which have 
already been reached with a high competence), at the edge inside each large un- 
reachable regions. This typically gives a high mean competence to each of these 
region when they are created. Then, due to the very large part of unreachable 
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Figure 6: Evolution of the distribution of self-generated goals displayed over 
time windows indexed by the number of performed goals, for an experiment of 
30000 micro-actions on a 15 DOF arm measuring 50 units. The black half-circle 
represents the contour of the area reachable by the arm. Higher values (dark 
red) corresponds to higher density of self-generated goals. (For interpretation 
of the references to color in this figure legend, the reader is referred to the web 
version of this article) 

areas, in comparison to reachable ones, the mean competence decreases over 
time. This brings interest to the region, thanks to the mathematical definition 
of the interest level, which, by using an absolute value, pushes the robot to- 
ward areas where the competence is decreasing. This complex process which 
allows driving the exploration in these kind of heterogeneous regions then allows 
dividing efficiently the task space into reachable and unreachable regions. 

Then, considering a global observation of subfigures 3 to 6, we can conclude 
that the system effectively autonomously discovers its own limits by focusing 
the goal self-generation inside reachable areas during the largest part of the 
exploration period. The system is indeed discovering that only a subpart is 
reachable due to the interest value becoming null in totally unreachable areas 
where the competence value is low. 

3.7.3 Exploration over Time inside Reachable Areas 

A more precise observation of subfigures 3 to 6 is presented in Fig. [8] where 
we can specifically observe the self-generated goals inside the reachable area. 
First, we can perceive that the system is originally focusing in an area around 
the end-effector's rest position y rest (shown by gray points in Fig. [8]). 

Then, it increases the radius of its exploration around y res t and focuses on 
areas further afield to the end-effector's rest position. Subfigures 2 and 3 shows 
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Figure 7: Evolution of the splitting of the task/goal space and creation of subre- 
gions indexed by the number of goals self-generated (without counting subgoals), 
for the experiment presented in Fig. [6] 
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Figure 8: Details of the evolution of the distribution of self-generated goals 
inside the reachable area for the experiment presented in Fig. [6j Gray points 
represent the end-effector rest position y rest . 

that the system explores new reachable parts corresponding to the right part 
close to its basis (subfigure 2), and then, the left part close to its basis (sub-figures 
3). 

Also, comparing the two first subfigures, and the two last ones, we observe 
a shift of the maximum exploration peak toward the arm basis. This is first 
linked with the loss of interest of self-generating goals around the end-effector's 
rest position. Indeed, because the system becomes highly efficient inside this 
region, the competence level becomes high and stationary over time, which leads 
to low interest values. At the same time, this phenomenon is also linked with 
the increase of competences in new reachable positions far from the end-effector 
rest position y res t, closer to its basis, which creates new regions of interest (see 
the four last subfigures of Fig. [5]). 
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3.7.4 Emergent Process 



The addition of subgoals and the consideration of each end-effector's position 
as a goal reached with the highest competence level have important influences 
on the learning process. If we look at traditional active learning algorithms 
which cannot deal with open-ended learning [301 B31 132] > as well as RIAC-like 
algorithms different from SAGG-RIAC [73 E3 El ES [Ml E] , we can notice that 
even if these techniques deal with avoiding excessive exploration in unlearnable 
or extremely complex areas, the learning process still has to begin by a period 
of random exploration of the whole space, to distinguish and extract which 
subparts are the most interesting according to the used definition of interest. 
Thanks to the addition of sub- goals and/or the consideration of every end- 
effector's position in SAGG-RIAC, in addition to exploring in the task space, 
we reduce the number of needed random global exploration, and improve the 
capability of the system to deal with large (i.e. when the volume of reachable 
space is small as compared to the volume of the whole space) task spaces. Using 
subgoals indeed creates a concentration of goals around the current end-effector's 
position, which progressively grows according to new experimented positions. 

Furthermore, the consideration of each end-effector's position for the estima- 
tion of competence allows discovering progressively which positions are reach- 
able with a high competence level, and gives a fast indication of first subregions 
where these high competences are situated. This increases the number of sub- 
regions close to the reachable areas and allows computing the interest values 
in the growing vicinity of the end-effector's experimented positions (see Fig. [7] 
where the progressive split of subregions in reachable areas is displayed). 

Therefore, these additions of competence measures allow the system to dis- 
cover and focus on areas where the competence is high in a very low number 
of goal self-generation, and tackle the typical problem of fast estimation and 
distinction of interesting areas. Nevertheless, this emergent process only helps 
to increase the number of feedbacks required by the goal self-generation mech- 
anism to split the space, and do not influence the low-level active learning. 
Then, the timeout which defines a goal as unreached during a single reaching 
attempt becomes crucial when considering high-volume task spaces with large 
unreachable parts as introduced in the following section. 



3.7.5 Robustness in High- Volume Task Spaces 

in the previous experiment, the timeout which describes a goal as not reached 
and stops a reaching attempt is defined as directly proportional to the number 
of micro-actions required to reach each goal. Practically, as introduced section 
3.6.2 we allowed the system to perform 1.5 times the distance between y s tart 



and y g before declaring a goal as not reached (including explorative movements). 

This timeout is efficient enough to learn efficiently by discriminating regions 
of different complexities in the middle-size space S' — [0; 150] x [—150; 150] 
considered in this experiment. Nevertheless, it can have an important influence 
on the SAGG learning process when considering extremely large task spaces 
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Figure 9: Histograms of self-generated goals displayed over time windows in- 
dexed by the number of performed goals, for an experiment of 30000 micro- 
actions on a 15 DOF arm, for a high- volume task space S = [—500; 500] x [0; 500], 
according to the reachable space contained in [—50; 50] x [0; 50] (the black half- 
circle represents the contour of the area reachable by the arm according to its 
length of 50 units). 



with small underlying reachable areas. For instance, if we consider a task space 
Y = [-500; 500] x [-0; 500] where only [-50; 50] x [0; 50] is reachable, the low- 
level of active learning will spend an extremely large number of iterations trying 
to reach each unreachable goal if this kind of timeout is used. 

Therefore, when considering such high- volume spaces, the definition of a new 
timeout becomes crucial. In Fig. [9j we demonstrate the high discriminating 
factor of SAGG-RIAC in such a task space (Y = [-500; 500] x [-0; 500]) when 
using a timeout which is not only based on the distance to the goal. This one has 
also been designed to stop a reaching attempt according to the following blocking 
criteria: let us consider a self-generated goal y g that the low-level exploration 
and reaching mechanisms try to reach. Then, if the system is not coming closer 
to the goal even after some low-level explorations, the exploration toward this 
precise goal stops. In a practical way, when w consecutive low-level explorations 
are triggered (typically w > 2) and thus no progress to the goal was made, 
we declare a goal as unreached, and compute the corresponding competence 
level. Using such a definition, the rapidity of discovering blocking situations 
will depend on both values of w and number of explorative actions q. Minimal 
values of these two parameters allows the fastest discoveries, but decrease the 
quality of the low-level exploration mechanism when exploring reachable spaces 
(in the experiment presented in Fig. p^we use q — 5 and w = 3). 
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3.7.6 Conclusion of Qualitative Results 

When considering low-level mechanisms allowing an efficient progressive learn- 
ing, the SAGG-RIAC algorithm is capable to discriminate very efficiently reach- 
able areas in such high- volume spaces. Then, it is also able to drive a progressive 
self-generation of goals through reachable subspaces of progressively growing 
complexities of reachability. 

In this experiment, the reachable region in the task space was convex and 
with no obstacles. Yet, as we will see in the fishing experiment below, SAGG- 
RIAC is capable of identifying correctly its zones of reachability, given a low- level 
optimization algorithm, even if there are "holes" or obstacles: goals initially 
generated in unreachable positions or in positions for which obstacles prevent 
their reaching provide a low level of competence progress, and thus the system 
stops trying to reach them. It is also possible to imagine that some given self- 
generated goals might be reachable only by an action policy going around an 
obstacle. Such a capability is not a property of the SAGG-RIAC architecture 
by itself, but a property of the optimization algorithm, and action represen- 
tation, that is used at the low-level goal-directed mechanism. In the present 
experiment, low- level optimization was a simple one only considering action 
policies going in a straight line to the goal. Yet, if one would have used more 
complex optimization leveraging continuous domain planning techniques (e.g. 
|121j ). the zones of reachability would be increased if obstacles are introduced 
since the low-level system could learn to go around them. 

3.8 Quantitative Results for Experiments 
with Task Spaces of Different Sizes 

In the following evaluation, we consider the same robotic system than previously 
described (15DOF arm of 50 units) and design different experiments. For each 
one, we estimate the efficiency of the inverse model learned by testing how it 
allows in average the robot to reach positions selected inside a test database of 
100 reachable positions (uniformly distributed in the reachable area and inde- 
pendent from the exploration of the robot). We will also compare SAGG-RIAC 
to three other types of exploration techniques: 

1. SAGG-RANDOM, where goals are chosen randomly (higher-level of active 
learning (RIAC) disabled) 

2. ACTUATOR-RANDOM, where small random micro-actions Aq are exe- 
cuted. This method corresponds to classical random motor babbling. 

3. ACTUATOR- RIAC, which corresponds to the original RIAC algorithm 
that uses the decrease in prediction errors (a, Aa) — > Ax to compute an 
interest value and split the space (a, Aa). 

Also, to be comparable to SAGG-RIAC, each ACTUATOR technique will have 
the position of the arm reset to the rest position every max micro-actions, max 
being the number of micro-actions needed to reach the more distant reachable 
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position, max is proportional to the desired velocity which is here of v = 2 
units/micro- action as well as the size of the task space (this will explain the 
different results of each ACTUATOR methods when used with task spaces of 
different sizes). In every graph, we present statistical results obtained after 
launching the same experiment with different random seeds 15 times. 



3.8.1 Exploration in the Reachable Space 

The first quantitative experiment is designed to compare the quality of inverse 
models learned using babbling in the task/operational space (i.e. using goals), 
instead of more traditional motor babbling heuristics executed in the configu- 
ration/actuator space. We still consider a n—15 DOF arm of 50 units, also, 
to be suited for the first study, dimensions of Y will be bounded in intervals 
y g E [0; 50] x [—50; 50] which means that the arm can reach almost all the space 
Y where goals can be chosen (the limits of reachability are thus almost given 
to the robot). In this experiment, we fix q = 20 for the SAGG methods and 
use a timeout only relative to the distance to the current goal (a end-effector 
movement of 1.5 times the one needed is allowed). 



Fig. 10 shows the evolution of the capability of the system to reach the 100 
test goals using the inverse model learned by each technique, starting from the 
rest position. This capability is computed using the mean Euclidian distance 
between the goal and the final state of a reaching attempt. 

Globally, these results show that in order to learn inverse kinematics of 
this highly- redundant arm, exploration in the goal/operational space is signif- 
icantly more efficient than exploration in the actuator space using either ran- 
dom exploration or RIAC-like active learning. Moreover, better performances 
of ACTUATOR-RANDOM compared to ACTUATOR-RIAC emphasizes that 
the original version of RIAC has not been designed for the efficient learning 
of inverse models of highly-redundant systems (high-dimension in the actuator 
space). 

Focusing on the evaluation of the two mechanisms which use SAGG, we can 
also make the important observation that SAGG-RIAC is here more efficient 
than SAGG-RANDOM when considering a system which already knows its own 
limits of reachability. More precisely, we observe both increase in learning speed 
and final generalization performances (this results resonates with results from 
more classic active learning, see [29]). These improvement signifies that SAGG- 
RIAC is efficiently able to progressively discriminate and focus on areas which 
bring the highest informational amount (i.e. areas which have not been visited 
enough). It brings to the learning system more useful data to create an efficient 
inverse model, contrarily to the SAGG-RANDOM approach which continues to 
select goals in already efficiently reached areas. 



3.8.2 Robustness in Large Task Spaces 

in the following experiment, we would like to test the capability of SAGG-RIAC 
to focus on reachable areas when facing high volume task spaces (will call this 
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Figure 10: Evolution of mean distances between the goal and the end effector 
after reaching attempts over an independently randomly generated set of test 
goals. Here SAGG-RIAC and SAGG- RANDOM are only allowed to choose goals 
within Y = [0; 50] x [—50; 50] (i.e. most eligible goals are physically reachable). 
Standard deviations are computed over 15 experiments at the same instants for 
each curve, and shifted in graphs for an easy reading. 



Therefore, we will here consider 



11 shows the learning efficiency 



phenomenon the discrimination capability), 
a task space Y = [0; 500] x [-500; 500]. Fig. 
of SAGG-RIAC using the timeout with blocking criteria as described in the 
section |3.7.5| This allows to test the quantitative aspect of the discrimination 
capability of SAGG-RIAC and its comparison with the three other techniques 
when facing high volume task spaces where only small subparts are reachable. 
As Fig. 11 shows, SAGG-RIAC is here the only method able to drive an efficient 
learning in such a space. SAGG- RANDOM actually spends the majority of the 
time trying to reach unreachable positions. Also, the size of the task space 
has an influence on the two ACTUATOR algorithms if we compare results in 
Y = [0; 50] x [-50; 50] introduced Fig. [To| and in Y = [0; 500] x [-500; 500] 
introduced Fig. 
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This is due to the value max of micro-actions performed 
by ACTUATOR methods which is proportional to the size of the task space as 
explained section 3.8 Results considering the space Y = [0; 500] x [—500; 500] 
seems more efficient for these methods, where the value of max is higher than 
in Y = [0; 50] x [—50; 50]. An increase of max thus allows these methods to 
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Figure 11: Evolution of mean distances between the goal and end effector after 
reaching attempts over an independently randomly generated set of test goals, 
averaged over 15 experiments. Here SAGG-RIAC and SAGG-RANDOM are 
allowed to choose goals within a large space corresponding to the one in Fig. 
[9j define as Y = [0; 500] x [—500; 500] (i.e. most eligible goals are physically 
unreachable). 

explore more efficiently the reachable space whose exploration is limited when 
considering a too low value of max. 

3.8.3 Robustness in Very Large Task Spaces 

Finally, we test the robustness of SAGG-RIAC in task spaces larger than in 
the previous section. Fig. [12] shows the behavior of SAGG-RIAC when used 
with task spaces of different sizes, from 1 to 900 times the size of the reachable 
space, and compare these results with a random exploration in the actuator 
space when the value of max is fixed as when Y = [0; 500] x [—500; 500]. We 
can notice here that, although the high discriminative capacity of SAGG-RIAC 
in large spaces such as Y = [0; 500] x [—500; 500], as shown previously, the 
performances of this technique decrease when the size of the considered task 
space increases. Therefore, we can observe that SAGG-RIAC obtains better re- 
sults than ACTUATOR-RANDOM since 5000 micro-actions when considering 
spaces smaller than Y = [0; 500] x [—500; 500]. Then, this method shows bet- 
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ter results than ACTUATOR-RANDOM only after 10000 micro-actions when 
considering the space Y — [0; 500] x [—500; 500]. And finally, this one becomes 
less efficient than ACTUATOR-RANDOM when the considered space increases 
in comparison to the reachable space, as shown by results when considering 
spaces Y = [0; 1000] x [-1000; 1000] and Y = [0; 1500] x [-1500; 1500]. These 
results clearly show that SAGG-RIAC is robust in spaces up to 100 times larger 
than the reachable space, but has some difficulties to explore even larger spaces. 
Therefore, despite the fact that SAGG-RIAC is very efficient in large spaces, 
it seems that the challenge of autonomous exploration in un-prepared spaces 
can not be totally resolved by this algorithm, a human supervisor being still 
necessary to define a set of (even very approximate) limits for the task space. 
As it will be emphasized in the perspective of this work, some complementary 
techniques should be used in order to bring robustness to such spaces, such as 
mechanisms inspired by the notion of maturational constraints which are able 
to fix limits on the task space since the beginning of the exploration process. 




Figure 12: Quantitative results of SAGG-RIAC when used with task spaces of 
different sizes and comparison with ACTUATOR-RANDOM. 



3.9 Quantitative Results for Experiments 

with Arm of Different Number of DOF and Geome- 
tries 

In every experiment, we set the dimensions of Y as bounded by the intervals 
y g E [0; 150] x [—150; 150], where 50 units is the total length of the arm, which 
means that the arm covers less than 1 /9 of the space Y where goals can be chosen 
(i.e. the majority of areas in the operational/task space are not reachable, which 
has to be discovered by the robot). 
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Figure 13: Evolution of mean distances between the goal and end effector after 
reaching attempts over an independently randomly generated set of test goals, 
averaged over 15 experimentations. Here SAGG-RIAC and SAGG-random are 
only allowed to choose goals within Y = [0; 150] x [—150; 150] (i.e. the set of 
reachable goals is only a small subset of eligible goals). 



For each experiment, we set the desired velocity v = 0.5 units/micro-action, 
and the number of explorative actions q = 20. Moreover, we reset the arm to 
the rest position (a rest , y res t) every r = 2 reaching attempts, which increases 
the complexity of the reaching process. 

We present a series of experiments aiming to test the robustness of SAGG- 
RIAC in arm setups with different shapes and numbers of degrees-of-freedom. 
Performed tests used 7, 15, and 30 DOF arms whose each limb has either the 
same length or a decreasing length depending on its distance from the arm's 
base (we use the golden number to specify the relative size of each part, taking 
inspiration from the architecture of human limbs). These experiments permit 
testing the efficiency of the algorithm for highly redundant systems (considering 
a 30 DOF arm corresponds to a problem of 62 continuous dimensions, with 60 
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dimensions in the actuator/state space and 2 dimensions in the goal/task space), 
and different morphologies. 

Also, to stress the capability of the system to make the robot self-discover 
its own limits, we remove the consideration of each end-effector position exper- 
imented as a goal reached with the highest level of competence (see 3.7.4). In 
these experiments, the competence level is therefore evaluated only for goals 
and subgoals. We fix q = 100, and compute tests of inverse models over 200000 
micro- actions. 



3.9.1 Quantitative Results 

Fig. [13] illustrates the performances of the learned inverse models when used 
to reach goals from an independent test database and evolving along with the 
number of experimented micro-actions. First, we can globally observe the slower 
decreasing velocity (over the number of micro- actions) of SAGG- RANDOM and 
SAGG-RIAC, compared to the previous experiment, which is due to the higher 
value of q and the removed consideration of every end-effector position. Graphs 
on the first line of Fig. [13] present the reaching errors of 7, 15 and 30 DOF 
arms with decreasing lengths. The first subfigure shows that when consid- 
ering 7 DOF, which is a relatively low number of degrees of freedom, SAGG- 
RANDOM is not the second more efficient algorithm. Indeed, the ACTUATOR- 
RANDOM method is here more efficient than SAGG-RANDOM after 25000 
micro-actions and is then stabilized, while SAGG-RANDOM is progressively 
decreasing, reaching the same level as ACTUATOR-RANDOM at the end of 
the experiment. This is due to the high focalization of SAGG-RANDOM outside 
the reachable area, which leads to numerous explorations toward unreachable 
positions. As shown also in this subfigure, adding the RIAC active component to 
SAGG efficiently improves the learning capabilities of the system; SAGG-RIAC 
reaching errors were indeed the lowest for this 7 DOF system. 

Experiments with 15 DOF and 30 DOF shows that both SAGG methods 
are here more efficient than actuator methods, SAGG-RIAC showing a signifi- 
cant improvement compared to every other algorithm (for 15DOF, the level of 
significance is p = 0.002 at the end of the experiment (200000 micro-actions)). 

Experiments presented with 7, 15 and 30 DOF arms where each limb has 
the same length show the same kind of results. The 7 DOF experiment shows 
that ACTUATOR-RANDOM can be more efficient than SAGG-RANDOM, and 
that the addition of RIAC allows obtaining a significant improvement in this 
case, but also when considering 15 and 30 DOF. 



3.9.2 Conclusion of Quantitative Results 

Globally, quantitative results presented here emphasize the high efficiency and 
robustness of SAGG-RIAC when carried out with highly redundant robotic se- 
tups of different morphologies, compared to more traditional approaches which 
explore in the actuator (input) space. They also showed that random explo- 
ration in the goal (output) space can be very efficient when used in high- 



3G 



I to 29 Goals 



29 to 57 Goals 





















1 





/I 



I 13 to 141 Goals 141 to 169 Goals 



Observation of the Camera (Goal Space) 




Figure 14: Histograms of self-generated goals displayed over time windows in- 
dexed by the number of performed goals, for an experiment of 10000 micro- 
actions on a real 8 DOF arm. Each histogram represents the surface covered by 
the camera, which here defines the task space. 



dimensional systems, even when considering a task space more than 9 times 
larger than the reachable subspace. These results therefore indicate the high 
potential of competence based motor learning for IK learning in highly- 
redundant robots. 

3.10 Qualitative Results for a Real 8 DOF Arm 

In this section, we test the robustness of the algorithm in a qualitative point of 
view when considering a real robotic setup (not simulated) which corresponds 
to the simulation presented above: we use a 8 DOF arm controlled in position. 
Also, helping to test the robustness of our method, we use low quality motors 
whose averaged noise is 20% for each movement. The fixed task space corre- 
sponds to the whole surface observable by a camera fixed on top of the robot, 
which is more than three times larger than the reachable space (see the left part 



of Fig. 14). In order to allow the camera to distinguish the end-effector of the 
arm and to create a visual referent framework on the 2D surface, we used visual 
tags and the software ARToolKit Tracker [8"T] . 



Fig. 14 (right part) shows histograms of self-generated goals displayed over 
sliding time windows indexed by the number of performed goals (without count- 
ing subgoals) for an experiment of 10000 micro- act ions. We can observe that the 
algorithm manages to discover the limits of the reachable area and drives the 
exploration inside after the goal 57. Then, the system continues to focus on the 
reachable space until the end of the experimentation, alternating between differ- 
ent areas inside. More precisely, we can notice while comparing the bottom-left 
subfigurc to the two positioned on the second line, that the system seems to 
concentrate only after some time on the areas situated close to its basis, and 
therefore more difficult to reach. The progressive increase of the complexity of 
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positions explored which appeared in simulation therefore also happens here. 
Finally, the last subfigure shows that the system continues its exploration to- 
ward an area more central of the reachable part. This is due to the high level of 
noise of the motor control: while the system is originally not very robust in this 
part of the space, an improvement of the generalization capacity of the learning 
algorithm allows obtaining an increase of competences in already visited areas 
of the task space. 

This experiment shows the efficiency of the SAGG-RIAC architecture to 
drive the learning process in real noisy robotic setups with only a few iterations, 
as well as its capacity to still control the complexity of the exploration when 
considering highly-redundant systems. 



4 Experimental Setup 2: Learning Omnidirec- 
tional Quadruped Locomotion with Motor Syn- 
ergies 

Sometimes stemming from pre-wired neuronal structures (e.g. central pattern 
generators [44J, [73j 09] ) , motor synergies are defined as the coherent activations 
(in space or time) of a group of muscles. They have been proposed as building 
blocks simplifying the scaffolding of motor behaviors because allowing the re- 
duction of the number of parameters needed to represent complex movements 
pHl 1551 fT51 1120] . Described as crucial for the development of motor abilities, 
they can be seen as encoding an unconscious continuous control of muscles which 
simplifies the complexity of the learning process: learning complex tasks using 
parameterized motor synergies (such as walking, or swimming) indeed corre- 
sponds to the tuning of relatively low-dimensional (but yet which can have a 
few dozen dimensions) high-level control parameters, compared to the impor- 
tant number of degrees of freedom which have to be controlled (thousand in the 
human body, see [16]). 



4.1 Formalization 

In the two following experiments, wc simplify the learning process by using 
such parameterized motor synergies controlling amplitude, phase, and velocity 
of Central Pattern Generators (CPGs). Mathematically, using motor synergies 
simplifies the description of the considered robotic system. In the framework 



introduced above (section 2.2) we defined our system as being represented by 
the relationship (s, a) — ¥ y, where for a given configuration s £ S, & sequence of 
actions a — {ai, a n } € A allows a transition toward y G Y . 

In the current framework we consider the sequence of actions as being gen- 
erated directly by parameterized motor synergies ng, which means that the 
sequence of actions is directly encoded and controlled (using feedbacks internal 
to the synergy) by setting parameters specified at the beginning of an action. 
For instance, in the experiment described in this section, we define a synergy as 




Figure 15: 12 degrees-of-freedom quadruped controlled using motor synergies 
parameterized with 24 values : 12 for the amplitudes and 12 others for the 
phases of a sinusoid tracked by each motor. Experiments consider a task space 
u,v,a which corresponds to the 2D position and orientation of the quadruped. 



a set of parameterized sinusoids (one on each joint) that a motor joint has to 
track with a low-level pre-programmed PID-like controller. Eventually, motor 
synergies can be seen as a way to encapsulate the low-level generation of se- 
quences of micro-actions, allowing the system to directly focus on the learning 
of models (s, irg) —> y, with seS fixed (the rest position of the robot) and 9 a 
set of parameters controlling the synergy (we will remove the fixed context s in 
the next notations for a easier reading and only write irg — > y)- 



4.2 Robotic Setup 

In the following experiment, we consider a quadruped robot simulated using 
the Breve simulator [S3] (physics simulation is based on ODE). Each of its leg 
is composed of 2 joints, the first (closest to the robot's body) is controlled by 
two rotational DOF, and the second, one rotation (1 DOF). Each leg therefore 



consists of 3 DOF, the robot having in its totality 12 DOF (See Fig. 15). 

This robot is controlled using motor synergies pig whose parameters 9 £ W 1 
directly specify the phase and amplitude of each sinusoid which controls the pre- 
cise rotational value of each DOF over time. These synergies are parameterized 
using a set of 24 continuous values, 12 representing the phase ph of each joint, 
and the 12 others, the amplitude am; 9 — {p/ii,2,..,i2; ami, 2,.., 12} > where each 
joint i receives the command am x sin(ujt+ph), with u) a fixed frequency. Each 
experimentation consists of launching a motor synergy ixg for a fixed amount 
of time, starting from a fixed position. After this time period, the resulting 
position yt of the robot is extracted into 3 dimensions: its position (u,v), and 
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its rotation <f>. The correspondence 9 — > (u, v, <fi) is then kept in memory as a 
learning exemplar. 

The three dimensions u, v, <f> are used to define the task space of the robot. 
Also, it is important to notice that precise areas reachable by the quadruped 
using these motor synergies cannot be estimated beforehand. In the following, 
we set the original dimensions of the task space to [—45; 45] x [—45; 45] x [— 27r; 2ir] 
on axis (it, v, </)), which was a priori larger than the reachable space. Then, after 
having carried out numerous experimentations, it appeared that this task space 
was actually more than 25 times the size of the area accessible by the robot (see 



red contours in Fig. 16 ) 



The implementation of our algorithm in such a robotic setup aims to test if 
the SAGG-RIAC driving method allows the robot to learn efficiently and accu- 
rately to attain a maximal amount of reachable positions, avoiding the selection 
of many goals inside regions which are unreachable, or that have previously been 
visited. 



4.3 Measure of competence 

In this experiment, we do not consider constraints p and only focus on reaching 
of the goal positions y g — (u g ,v g , 4> g ). In every iteration the robot is reset 



to a same configuration called the origin position (see Fig. 17). We define 
the competence function C using the Euclidian distance goal/robot's position 
D(y g , yf) after a reaching attempt, which is normalized by the original distance 



between the origin position y rigini an d the goal D(y origin ,y g ) (See Fig. 17). 

In this measure of competence, we compute the Euclidian distance using 
(u,v,(j)) where dimensions are rescaled in [0; 1]. Each dimension therefore has 
the same weight in the estimation of competence (an angle error of cf> = ^- is 
as important as an error u = ^ or v — ^). 



C(y g ,y f ,y start ) = 
where C(y g ,y f ,y sta rt) = if D(y start ,y g ) 

4.4 Active Goal Directed Exploration and Learning 

Reaching a goal y g necessitates the estimation of a motor synergy we i leading 
to this chosen state y g . Considering a single starting configuration for each 
experimentation, and motor synergies 7rg, the forward model which defines this 
system can be written as the following: 

6^(u,v,<f>) (9) 

Here, we have a direct relationship which only considers the 24 dimensional 
parameter vector 9 = {phi ,2,. .,12; o,mi2,..,i2} of the synergy as inputs of the sys- 
tem, and a position in (u, v, 4>) as output. We thus have a fixed context and use 



D (y g >yf) 

D(y s tart,y g ) 



(8) 



= 0. 
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Figure 16: Positions explored by the quadruped inside the task space u,v,(j> 
after 10000 experiments (running a motor synergy during a fixed amount of 
time), using different exploration mechanisms. Red lines represents estimated 
limits of reachability. (For interpretation of the references to color in this figure 
legend, the reader is referred to the web version of this article) 



here an instantiation of the SAGG-RIAC architecture with local optimization 
algorithm Alg. |4j detailed below. 
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Figure 17: Example of experimentation of the quadruped and illustration of be- 
ginning position, goal position (u g , v g ,(f> g ), and a corresponding reached position 
(uf,Vf,(f)f) whose value are used to compute the measure of competence. 



4.4.1 Reaching Phase 

The reaching phase deals with reusing the data already acquired and use local 
regression to compute an inverse model ((u,v,<p) — > 9)l in the locality L of 
the intended goal y g — (u g ,v g , (j) g ). In order to create such a local inverse 
model (numerous other solutions exist, such as [20 ) 18) l55l 1108] ). we extract the 
potentially more reliable data using the following method: 

We first extract from the learned data the set L of the I nearest neighbors 
of (u g ,v g ,4> g ) and then retrieve their corresponding motor synergies using an 
ANN method [70]: 

L = {{u,v,cj),9} 1 ,{u,v,(f),8}2,...,{u,v,<i),9}i} (10) 

Then, we consider the set M which contains I sets of m elements: 

!Mi : {{u, v, cj), 9}x, {u, v, (/), 9} 2 , {it, v, (j), 9} m } 1 
M 2 :{{u,v,(t>,6}i,{u,v,<p,0}2,...,{u,v,<p,9} m } 2 > 
Mi : {{u, v, <f>, 0}i, {u, v, (j), 9} 2l {u, v, <j), 9} m } l _ 

where each set {{u, v, (j), 9}\, {u, v, <fi, 9} 2 , {u, v, <f>, 9} m \ i corresponds to the m 
nearest neighbors of each 0j, i € L, and their corresponding resulting position 
(u,v,4>). 

For each set {{w, v, <f>, 6*}i, {u, v, 4>, 9} 2 , {u, v, <ft, 9} m } i , we estimate the 
standard deviation a of the parameters of their motor synergies 9 : 



a{M 3 ) = a(9 j e{{u,v,cf>,9} li ..., m }) (12) 

Finally, we select the set Mk = {{u, v, </>, 9}\, {u, v, tf>, 9} 2 , {u, v, 4>, 9} m } 
inside M such that it minimizes the standard deviation of its synergies: 

Mk = argmirii a(Mi) (13) 
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From Mfc, we estimate a local linear inverse model ((it, v, 4>) — > 9) by using 
a pseudo-inverse as introduced in the reaching experiment, and use it to esti- 
mate the motor synergy parameters 9 g which correspond to the desired goal 

(Ug,V g ,(f) g ). 

4.4.2 Exploration Phase 

The system here continuously estimates the distance between the goal y g and 
already reached position y c which is the closest from the goal. If the reaching 
phase does not manage to make the system come closer to y g , i.e. D(y g ,y t ) > 
D(y g ,y c ), with y t as last effectively reached point in an attempt toward y g , the 
exploration phase is triggered. 

In this phase the system first considers the nearest neighbor y c — (it c , v c ,(f) c ) 
of the goal (u g ,v g ,4> g ) and gets the corresponding known synergy 9 C . Then, 
it adds a random noise rand{2A) to the 24 parameters {p/ii,2,..,i2, «^i,2,..,12}c 
of this synergy 9 C which is proportional to the Euclidian distance D(y g ,y c ). 
The next synergy t+1 = {pfti 2,..,i2j a m i,2,..,i2}t+i to experiment can thus be 
described using the following equation: 

a ( {pftl,2,..,12,<OTll,2,..,12}c \ z,^ 

t+1 + X.rand(24).D(y g ,y c ) J 

where rand(i) returns a vector of i random values in [— 1;1], A > and 
{p^i,2,..,i2) csmi i 2,..,i2}c the motor synergy which corresponds to y c . 

4.5 Qualitative Results 

Fig. [16] presents the positions explored by the quadruped inside the task space 
it, v, </> after 10000 experimentations (running of motor synergies during the same 
fixed amount of time) using the exploration mechanisms introduced previously. 
ACTUATOR-RANDOM and ACTUATOR-RIAC select parameters of motor 
synergies in this experiment, whereas SAGG-RANDOM and SAGG-RIAC self- 
generate goals (u,v,(j>). 

Comparing the two first exploration mechanisms (ACTUATOR-RANDOM 
and ACTUATOR-RIAC) we cannot distinguish any notable difference, the space 
explored appears similar and the extent of explored space on the (u, v) axis is 
comprised in the interval [—5; 5] for it and [—2.5; 2.5] for v on both graphs. 
Moreover, we notice that the difference between it and v scales is due to the 
inherent structure of the robot, which simplifies the way to go forward and 
backward rather than shifting left or right. 

Considering SAGG methods, it is important to note the difference between 
the reachable area and the task space. In Fig. [THJ red lines correspond to the 
estimated reachable area which is comprised of [—10; 10] x [—10; 10] x [— tt;tt], 
whereas the task space is much larger: [—45; 45] x [—45; 45] x [— 2n; 2n}. We are 
also able to notice the asymmetric aspect of its repartition according to the v 
axis, which is due to the decentered weight of the robot's head. 
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Reaching Error 




Number of Actions (time steps) 

Figure 18: Quantitative results for the quadruped measured using the reaching 
error over the number of experimentations. 

First, the SAGG-RANDOM method seems to slightly increase the space 
covered on the u and v axis compared to ACTUATOR methods, as shown by 
the higher concentration of positions explored in the interval [—5; —3] U [3; 5] 
of u. However, this change does not seem very important when comparing 
SAGG-RANDOM to these two algorithm. 

Second, SAGG-RIAC, contrary to SAGG-RANDOM, shows a large explo- 
ration range: the surface in u has almost twice as much coverage than using 
previous algorithms, and in v, up to three times; there is a maximum of 7.5 in 
v where the previous algorithms were at 2.5. These last results emphasize the 
capability of SAGG-RIAC to drive the learning process inside reachable areas 
which are not easily accessible (hardly discovered by chance). 

4.6 Quantitative Results 

In this section, we aim to test the efficiency of the learned forward/inverse mod- 
els to guide the quadruped to reach a set of goal positions from an independently 
generated test database. Here we consider a test database of 100 goals, gen- 
erated independantly and covering approximately uniformly the reachable part 
of the task space, and compute the distance between each goal attempted, and 
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the reached position. Fig. 18 shows performances of the 4 methods introduced 



previously. First of all, we can observe the higher efficiency of SAGG-RIAC 
compared to the other three methods which can be observed after only 1000 
iterations. The high decreasing velocity of the reaching error (in the number of 
experimentations) is due to the consideration of regions limited to a small num- 
ber of elements (30 in this experiment) . It allows creating a very high number 
of regions within a small interval of time, which helps the system to discover 
and focus on reachable regions and its surrounding area. 

ACTUATOR- RIAC shows slightly more efficient performances than ACTUATOR- 
RANDOM. Also, even if SAGG-RANDOM is less efficient than SAGG-RIAC, 
we can observe its highly decreasing reaching errors compared to ACTUATOR 
methods, which allows it to be significantly more efficient than these method 
when considered at 10000 iterations. Again, as in the previous experiment, we 
can also observe that SAGG-RIAC does not only allow to learn faster how to 
master the sensorimotor space, but that the asymptotic performances also seem 
to be better I3TJ1. 



4.7 Conclusion of Results for the Quadruped Experiment 

These experiments first emphasize the high efficiency of methods which drives 
the exploration of motor synergies in terms of their effects in the task space. As 
illustrated by qualitative results, SAGG methods, and especially SAGG-RIAC, 
allows driving the exploration in order to explore large spaces containing areas 
hardly discovered by chance, when limits of reachability are very difficult to 
predict. Then, quantitative results showed the capability of SAGG-RANDOM 
and SAGG-RIAC methods to learn inverse models efficiently when considering 
highly-redundant robotic systems controlled with motor synergies. 



5 Experimental Setup 3: Learning to Control a 
Fishing Rod with Motor Synergies 

5.1 Robotic Setup 

This experiments consists of having a robot learning to control a fishing rod 
(with a flexible wire) in order to attain certain positions of the float when it 
touches the water. This setup is simulated using the Breve simulator, such as 
in the previous experiment. The rod is fixed on a 4 DOF arm controlled with 
motor synergies which affect the velocity of each joint, and are parameterized 
by the values 9 — V2, 113, U4), Vi € [0; 1]. More precisely, for each experimen- 
tation of the robot we use a low-level pre-programmed PID controller which 
tracks the desired velocity i>i of each joint i during a fixed short amount of time 
(2 seconds), starting from a fixed rest position, until suddenly stopping the 
movement. During the movement, as well as a few second after, we monitor the 
3D position of the float in order to detect a potential contact with the water (a 
flat plane corresponding to the water level). If the water is touched, we extract 
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. ^flexible wirej y^^, Fishing Rod" ) 




Reachable Space 



Figure 19: 4 degrees-of-freedom arm with a fixed fishing rod at its extrem- 
ity. The arm is controlled using motor synergies which affect the velocity of 
each joint, and are parameterized by 4 values. Experiments consider a two- 
dimensional task space x, y which corresponds to the position of the float when 
touching the water after performing a movement. 



the 2D coordinates (x, y) of the float on the plane (if not, we do not consider 
this trial). These coordinates, as well as the parameters of the synergies will 
be used to describe the forward model of the system as (v±, V2, V3, V4) — > (x,y). 
Learning will thus be performed while recording each set v%, U3, 1*4), (x, y)}i 
as a learning exemplar. In such a sensorimotor space, studying the behavior of 
SAGG-RIAC is relevant according to the flexible aspect of the line, which makes 
this system very difficult to model analytically, because it is highly redundant 
and highly sensitive to small variations of inputs. In the following experiment, 
the task space will consist of a limited area of the water surface. We will con- 
sider the basis of the arm as fixed on the coordinates (0, 0), the limits of the task 
space will be fixed to [—3; 3] x [—3; 3] while the reachable region corresponds to 



a disk whose radius is 1, and can be contained in [—1; 1] x [— 1; 1] (see Fig. 19) 



5.2 Qualitative Results 

Fig. [20] shows histograms of the repartition of positions reached by the float 
on the water surface computed after 10000 "water touched" trials (a "water 
touched" trial corresponds to a reaching attempt where the float effectively 
touches the surface), after running ACTUATOR-RANDOM and SAGG- 
RIAC exploration processes. The point situated at the center corresponds to 



the base to which the arm handling the fishing rod is situated (see Fig. 19). 
While observing the two figures, we can note a repartition of positions sit- 
uated inside a disk, which radius delimits position reached when the line is 
maximally slack. Yet, the distribution of reached (and reachable) positions 
within this disk is both asymetrical among and between the two exploration 
processes. The asymetries on each figure are in fact reflecting the asymetries 
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of the robot setup (see Fig. 19 1: the geometry of the robot is not symmetric 
and its starting/rest configuration is also not symmetric. Coupled with the 
structure of motor primitives, this makes that the structure of the reachable 
positions is complex and asymetric, and this can be observed especially in the 
ACTUATOR-RANDOM sub-figure, since it shows the asymetric distribu- 
tion of float position reached when the parameters of the action primitives are 
sampled uniformly (and thus symmetrically). Comparing the two histograms, 
we note that SAGG-RIAC drives the exploration toward positions of the float 
not explored by ACTUATOR-RANDOM, such as the large part situated 
at the bottom of the reachable area. Thus, SAGG-RIAC drives here the ex- 
ploration toward more diverse regions of the space. SAGG-RIAC is therefore 
able to avoid spending large amounts of time exclusively guiding the exploration 
toward the same areas, as ACTUATOR-RANDOM does. Extended exper- 
imentation with this setup showed that the distribution of reached points with 
SAGG-RIAC (right sub-figure) corresponds closely to the actual whole reach- 
able space. Eventually, these qualitative results emphasize that SAGG-RIAC 
is able to drive the exploration process efficiently when carried out with highly 
redundant and complex robots with compliant/soft parts. 



ACTUATOR-Random 



SAGG-RIAC 




Figure 20: Histograms of positions reached by the float when entering in contact 
with the water in the fishing experiment, after 10000 contact float/water, using 
ACTUATOR-RANDOM and SAGG-RIAC exploration methods. 



5.3 Quantitative Results 

Fig. [21] shows the mean reaching errors obtained using ACTUATOR- RANDOM 
and SAGG-RIAC, statistically computed after 10 experiments with different 
random seeds. Here, the comparison of these two methods shows that SAGG- 
RIAC led to significantly more efficient results after 1000 successful trials. Also, 
after 6000 trials, we can observe a small increase in reaching errors of SAGG- 
RIAC. This phenomenon is due to the discovery of new motor synergies which 
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Figure 21: Quantitative results for the fishing experiment measured using the 
reaching error over the number of experimentations. 

led to already mastered goal positions. This discovered redundancy reduces the 
generalization capability for computing the inverse model for a small amount of 
time until these new parameters of motor synergy have been explored enough 
to disambiguate the invert model (i.e. two distinct local inverse models are well 
encoded and do not interfere). 

6 Conclusion and Future Work 

This paper introduced the Self-Adaptive Goal Generation architecture, SAGG- 
RIAC, for active learning of inverse models in robotics through intrinsically mo- 
tivated goal exploration. First, we demonstrated the high efficiency of learning 
inverse models by performing an exploration driven by the active self-generation 
of high-level goals in the parameterized task space instead of traditional motor 
babbling specified inside a low-level control space. Active exploration in the 
task space leverages the redundancy often characterizing sensorimotor robotic 
spaces: this strategy drives robots to learn a maximal amount of tasks (i.e. 
learn to generate in a controlled manner a maximal number of effects in the 
task space), instead of numerous ways to perform the same tasks (i.e. learn 
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many action policies to achieve the same effect in the task space). Coupling 
goal babbling and sophisticated intrinsically motivated active learning also al- 
lows a robot to perform efficient autonomous learning of its limits of reachability, 
and of inverse models with unknown high-dimensional body schemas of differ- 
ent architectures. Intrinsically motivated active learning was here driven by the 
active stochastic search of areas in the task space where competence progress 
is maximal. This also allowed emerging developmental trajectories by driving 
the robot to progressively focus and learn tasks of increasing complexities, while 
discovering its own limits of reachability, avoiding to spend much exploration 
time trying to perform impossible tasks. 

While we showed that such an approach could allow efficient learning when 
the action space was continuous and high-dimensional, the experiments per- 
formed here were assuming that a low-dimensional task space was initially pro- 
vided. It is frequent to have such low-dimensional task spaces for useful engi- 
neering problems in robotics, where one can assume that an engineer helps the 
robot learner by designing by hand the task space (including the choice of the 
variables and parameters specifying the task space). On the other hand, if one 
would like to use an architecture like SAGG-RIAC in a developmental frame- 
work, where one would not assume low-dimensional task spaces pre-specified to 
the robot, some additional mechanisms should be added to equip the robot with 
the following two related capabilities: 

• Find autonomously low-dimensional task spaces. Indeed, a too high di- 
mension of a task space would make the evaluation of "competence progress' 
suffer from the curse of dimensionality; 

• Explore actively multiple task spaces (potentially an open-ended number 
of task space), thus opening the possibility to learn fields of skills which 
may be of different kinds; 

There are several potential approaches that could be used to address these issues 
that include: 

• Mechanisms for higher-level stochastic generation of task spaces, 

and their active selection through global measures of competence progress, 
forming an architecture with three levels of active learning (active choice 
of a task space inside a space of tasks spaces, active choice of goals inside 
the chosen task space, and active choice of actions to learn to reach the 
chosen goal) would be a natural extension of the work presented in this 
article. 

• Social guidance and learning by interaction: social guidance mechanisms 
allowing a non-engineer human to drive the attention of a robot toward 
particular task spaces, through physical guidance PHET] or human-robot 
interfaces allowing the robot to be attracted toward particular dimen- 
sions of the environment (88j . may be introduced. Inverse reinforcement 
learning mechanisms, which are able to extract reward functions thanks 
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to examples of action policies could also be seen as a mean to infer in- 
teresting task spaces from human demonstrations |50) . Social guidance 
may also be used as a mechanism to bootstrap the evaluation of compe- 
tence progress, and the identification of zones of reachability, in very large 
or high-dimensional spaces such as shown in |71) . which presents an ap- 
proach to combine intrinsically motivated learning like SAGG-RIAC with 
techniques for learning by demonstration. 

• Maturational constraints: Although SAGG-RIAC highly simplifies the 
learning process by using goal babbling and drives it efficiently thanks to 
intrinsic motivations, learning still have to begin by a period of random 
exploration in order to discriminate unreachable areas as well as areas of 
differing interests. This becomes a problem when the volume of reachable 
areas in the task space is a lot smaller than the task space itself or when 
the task space becomes itself high-dimensional. An important direction 
for future work is to take inspiration from the maturational processes 
of infants which are constrained in their learning and development by 
numerous physiological and cognitive mechanisms such as the limitation 
of their sensorimotor apparatus, as well as the evolving capabilities of 
their brain [Tt)l I122L l2"2l 166| . For instance, infants have a reduced visual 
acuity which prevents them from accessing high visual frequencies as well 
as distinguishing distant objects. This acuity then progressively grows as 
the maturation process evolves. Using such constraints in synergy with 
goal babbling and intrinsic motivation, such as explored in |10j . would 
potentially allow to constrain and simplify further learning since the first 
actions of the robot [39l [42] , and could be crucial when considering life- 
long learning in unbounded task spaces. 
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Algorithm 1 The SAGG-RIAC Architecture 
S: State/Context space 
II: Space of paremeterized action policies irg 
Y: Space of parameterized tasks yi 

M: regression model of the forward mapping (S, II) — > Y 

M _1 : regression model of the inverse mapping (S,Y) — > II 

R: set of regions RjCF and corresponding measures interest^, 

input: thresholds £q\ £ m ax'i timeout 

input: rest position s res t £ S; reset value: r 

input: starting position s start € S 

input: number of explorative movements ijeN 

input: starting time: t 

input: q budget of physical experiments for goal-directed optimization 
loop 

Reset the system in the resting state (s s t a rt = s res t) every r iteration of 
the loop; 

Active Goal Self- Generation (high-level): 

Self-generate a goal y g £ Y using the mode(m £ [1; 2; 3]) with probability 
p m (see Section 2.4.4.) 

Active Goal-Directed Exploration and Learning (low-level): 

Let St represent the current context of the system 

if Made possible by the sensorimotor space then 

Compute a set of subgoals {yi,y%, ■■■,y n } € Y on the pathway toward 
y g ; (e.g. with a planning algorithm that takes s, M, M^ 1 and y g into 
account); 

else 

{yi,V2, -,j/n} = 0; 

end if 

for each yj in {y x , y 2 , ...,y n } U y g do 

while T y . < Ec & timeout not exceeded do 

Compute and execute an action/synergy ng j £ H using M _1 such that 
it targets yj, e.g. using techniques such as in [2"U1 IH1 1551 1108j : 
Get the resulting actually performed y~j and update M and M^ 1 with 
new data (s t ,9j,yj) 

Compute the competence ry. (see section 2.4.1.) 

UpdateRegions(R, y~j, Ty.); 

if experiment with evolving context then 

Goal-directed optimization of Oj to reach yj, with SSA like algo- 
rithm such as Algorithm |3j and given a budget of q allowed physical 
experiments; 

else 

Goal-directed optimization of Oj to reach yj such as algorithm |4j or 
alternatively algorithms such as [1121 l45l [79] , and given a budget of 
q allowed physical experiments; 
end if 
end while 

Compute the competence T yj (see section 2.4.1.) 
UpdateRegions(R, y j} T yj ); g2 
end for 
end loop 



Algorithm 2 Pseudo-Code of UpdateRegions 

input: R: : set of regions R; C Y and corresponding measures inter esU; 

input: y t : current goal 

input: T yt : competence measure for y t 

Let guax be the maximal number of elements inside a region 
Let C be a time window used to compute the interest 
Find the region R n in R such that yt G R n ', 
Let k = card(R n ) 

Add T yt j c in i? n , where k is an indice indicating the ordinal order in which 
Fj, t was added in the region as compared to other measures of competences 
in R n ; 

Compute the new value of inter est n of R n according to each r^,; G R n such 
that: 



/ \R n \- 



( \Rn\ 



\ 



E r fcl , 



interest n = 

if card(R n ) > g max then 

Split R n ; (see text, section 2.4.4) 
end if 



Algorithm 3 Example of Pseudo-Code for the Low-Level Goal-Directed Explo- 
ration with Evolving Context (used in the experimentation introduced section 
3~3| 

input: q is the budget of physical experiment allowed to the robot for local 
optimization; 

Update the current context s t — sj; {where Sj is the context after having 
performed irg . } 

if Inefficient (M yj, yj) then 
Local Exploration Phase: 

for i = 1 to q do 

Perform action policy irg i with 9i drawn randomly in the vicinity of 9j ; 
Measure the resulting yi and Sf, 
Update M and M" 1 with (s t) 0i,yi); 
Update the context s t = s,; 
Compute the competence T y . ; 
UpdateRegions(R, yi, T Vi ); 
end for 
end if 
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Algorithm 4 Example of Pseudo-Code for the Low-Level Goal-Directed Explo- 
ration with a Fixed or Resettable Context (used in the experiments introduced 
sections [|and[5| 

input: q is the budget of physical experiment allowed to the robot for local 
optimization; 

Reset the current context: St = s res t\ 
if Inefficient (M _1 , jjj, yj) then 
Local Exploration Phase: 

Initialize 8^ = 9j and y^ = yj and T yk = Y y . 
for i — 1 to q do 

Perform TTg i where 9i is drawn randomly in the vicinity of 

Observe the resulting yi\ 

Update M and M _1 with the resulting (s t ,0i,yi); 
Reset the current context: St — s res t; 
Compute the competence T y .; 
UpdateRegions(R, yi, T Vi ); 
if T Vi > T Vk then 
9k = ®i 

Vk = yt 

end if 
end for 
end if 
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